Quasi-newton methods¶

This subpackage contains Quasi-newton methods that estimate the hessian using gradient information.

BFGS ¶

Bases: torchzero.modules.quasi_newton.quasi_newton._InverseHessianUpdateStrategyDefaults

Broyden–Fletcher–Goldfarb–Shanno Quasi-Newton method. This is usually the most stable quasi-newton method.

Note

a line search or a trust region is recommended

Warning

this uses at least O(N^2) memory.

Parameters:

init_scale (float | Literal['auto'], default: 'auto' ) –

initial hessian matrix is set to identity times this.

"auto" corresponds to a heuristic from Nocedal. Stephen J. Wright. Numerical Optimization p.142-143.

Defaults to "auto".
tol (float, default: 1e-32 ) –

tolerance on curvature condition. Defaults to 1e-32.
ptol (float | None, default: 1e-32 ) –

skips update if maximum difference between current and previous gradients is less than this, to avoid instability. Defaults to 1e-32.
ptol_restart (bool, default: False ) –

whether to reset the hessian approximation when ptol tolerance is not met. Defaults to False.
restart_interval (int | None | Literal['auto'], default: None ) –

interval between resetting the hessian approximation.

"auto" corresponds to number of decision variables + 1.

None - no resets.

Defaults to None.
beta (float | None, default: None ) –

momentum on H or B. Defaults to None.
update_freq (int, default: 1 ) –

frequency of updating H or B. Defaults to 1.
scale_first (bool, default: False ) –

whether to downscale first step before hessian approximation becomes available. Defaults to True.
scale_second (bool) –

whether to downscale second step. Defaults to False.
concat_params (bool, default: True ) –

If true, all parameters are treated as a single vector. If False, the update rule is applied to each parameter separately. Defaults to True.
inner (Chainable | None, default: None ) –

preconditioning is applied to the output of this module. Defaults to None.

Examples:¶

BFGS with backtracking line search:

opt = tz.Optimizer(
    model.parameters(),
    tz.m.BFGS(),
    tz.m.Backtracking()
)

BFGS with trust region

opt = tz.Optimizer(
    model.parameters(),
    tz.m.LevenbergMarquardt(tz.m.BFGS(inverse=False)),
)

Source code in torchzero/modules/quasi_newton/quasi_newton.py

class BFGS(_InverseHessianUpdateStrategyDefaults):
    """Broyden–Fletcher–Goldfarb–Shanno Quasi-Newton method. This is usually the most stable quasi-newton method.

    Note:
        a line search or a trust region is recommended

    Warning:
        this uses at least O(N^2) memory.

    Args:
        init_scale (float | Literal["auto"], optional):
            initial hessian matrix is set to identity times this.

            "auto" corresponds to a heuristic from Nocedal. Stephen J. Wright. Numerical Optimization p.142-143.

            Defaults to "auto".
        tol (float, optional):
            tolerance on curvature condition. Defaults to 1e-32.
        ptol (float | None, optional):
            skips update if maximum difference between current and previous gradients is less than this, to avoid instability.
            Defaults to 1e-32.
        ptol_restart (bool, optional): whether to reset the hessian approximation when ptol tolerance is not met. Defaults to False.
        restart_interval (int | None | Literal["auto"], optional):
            interval between resetting the hessian approximation.

            "auto" corresponds to number of decision variables + 1.

            None - no resets.

            Defaults to None.
        beta (float | None, optional): momentum on H or B. Defaults to None.
        update_freq (int, optional): frequency of updating H or B. Defaults to 1.
        scale_first (bool, optional):
            whether to downscale first step before hessian approximation becomes available. Defaults to True.
        scale_second (bool, optional): whether to downscale second step. Defaults to False.
        concat_params (bool, optional):
            If true, all parameters are treated as a single vector.
            If False, the update rule is applied to each parameter separately. Defaults to True.
        inner (Chainable | None, optional): preconditioning is applied to the output of this module. Defaults to None.

    ## Examples:

    BFGS with backtracking line search:

    ```python
    opt = tz.Optimizer(
        model.parameters(),
        tz.m.BFGS(),
        tz.m.Backtracking()
    )
    ```

    BFGS with trust region
    ```python
    opt = tz.Optimizer(
        model.parameters(),
        tz.m.LevenbergMarquardt(tz.m.BFGS(inverse=False)),
    )
    ```
    """

    def update_H(self, H, s, y, p, g, p_prev, g_prev, state, setting):
        return bfgs_H_(H=H, s=s, y=y, tol=setting['tol'])
    def update_B(self, B, s, y, p, g, p_prev, g_prev, state, setting):
        return bfgs_B_(B=B, s=s, y=y, tol=setting['tol'])

BroydenBad ¶

Bases: torchzero.modules.quasi_newton.quasi_newton._InverseHessianUpdateStrategyDefaults

Broyden's "bad" Quasi-Newton method.

Note

a trust region or an accurate line search is recommended.

Warning

this uses at least O(N^2) memory.

Reference

Spedicato, E., & Huang, Z. (1997). Numerical experience with newton-like methods for nonlinear algebraic systems. Computing, 58(1), 69–89. doi:10.1007/bf02684472

Source code in torchzero/modules/quasi_newton/quasi_newton.py

class BroydenBad(_InverseHessianUpdateStrategyDefaults):
    """Broyden's "bad" Quasi-Newton method.

    Note:
        a trust region or an accurate line search is recommended.

    Warning:
        this uses at least O(N^2) memory.

    Reference:
        Spedicato, E., & Huang, Z. (1997). Numerical experience with newton-like methods for nonlinear algebraic systems. Computing, 58(1), 69–89. doi:10.1007/bf02684472
    """
    def update_H(self, H, s, y, p, g, p_prev, g_prev, state, setting):
        return broyden_bad_H_(H=H, s=s, y=y)
    def update_B(self, B, s, y, p, g, p_prev, g_prev, state, setting):
        return broyden_bad_B_(B=B, s=s, y=y)

BroydenGood ¶

Bases: torchzero.modules.quasi_newton.quasi_newton._InverseHessianUpdateStrategyDefaults

Broyden's "good" Quasi-Newton method.

Note

a trust region or an accurate line search is recommended.

Warning

this uses at least O(N^2) memory.

Reference

Spedicato, E., & Huang, Z. (1997). Numerical experience with newton-like methods for nonlinear algebraic systems. Computing, 58(1), 69–89. doi:10.1007/bf02684472

Source code in torchzero/modules/quasi_newton/quasi_newton.py

class BroydenGood(_InverseHessianUpdateStrategyDefaults):
    """Broyden's "good" Quasi-Newton method.

    Note:
        a trust region or an accurate line search is recommended.

    Warning:
        this uses at least O(N^2) memory.

    Reference:
        Spedicato, E., & Huang, Z. (1997). Numerical experience with newton-like methods for nonlinear algebraic systems. Computing, 58(1), 69–89. doi:10.1007/bf02684472
    """
    def update_H(self, H, s, y, p, g, p_prev, g_prev, state, setting):
        return broyden_good_H_(H=H, s=s, y=y)
    def update_B(self, B, s, y, p, g, p_prev, g_prev, state, setting):
        return broyden_good_B_(B=B, s=s, y=y)

DFP ¶

Bases: torchzero.modules.quasi_newton.quasi_newton._InverseHessianUpdateStrategyDefaults

Davidon–Fletcher–Powell Quasi-Newton method.

Note

a trust region or an accurate line search is recommended.

Warning

this uses at least O(N^2) memory.

Source code in torchzero/modules/quasi_newton/quasi_newton.py

class DFP(_InverseHessianUpdateStrategyDefaults):
    """Davidon–Fletcher–Powell Quasi-Newton method.

    Note:
        a trust region or an accurate line search is recommended.

    Warning:
        this uses at least O(N^2) memory.
    """
    def update_H(self, H, s, y, p, g, p_prev, g_prev, state, setting):
        return dfp_H_(H=H, s=s, y=y, tol=setting['tol'])
    def update_B(self, B, s, y, p, g, p_prev, g_prev, state, setting):
        return dfp_B(B=B, s=s, y=y, tol=setting['tol'])

DNRTR ¶

Bases: torchzero.modules.quasi_newton.quasi_newton.HessianUpdateStrategy

Diagonal quasi-newton method.

Reference

Andrei, Neculai. "A diagonal quasi-Newton updating method for unconstrained optimization." Numerical Algorithms 81.2 (2019): 575-590.

Source code in torchzero/modules/quasi_newton/diagonal_quasi_newton.py

class DNRTR(HessianUpdateStrategy):
    """Diagonal quasi-newton method.

    Reference:
        Andrei, Neculai. "A diagonal quasi-Newton updating method for unconstrained optimization." Numerical Algorithms 81.2 (2019): 575-590.
    """
    def __init__(
        self,
        lb: float = 1e-2,
        ub: float = 1e5,
        init_scale: float | Literal["auto"] = "auto",
        tol: float = 1e-32,
        ptol: float | None = 1e-32,
        ptol_restart: bool = False,
        gtol: float | None = 1e-32,
        restart_interval: int | None | Literal['auto'] = None,
        beta: float | None = None,
        update_freq: int = 1,
        scale_first: bool = False,
        concat_params: bool = True,
        inner: Chainable | None = None,
    ):
        defaults = dict(lb=lb, ub=ub)
        super().__init__(
            defaults=defaults,
            init_scale=init_scale,
            tol=tol,
            ptol=ptol,
            ptol_restart=ptol_restart,
            gtol=gtol,
            restart_interval=restart_interval,
            beta=beta,
            update_freq=update_freq,
            scale_first=scale_first,
            concat_params=concat_params,
            inverse=False,
            inner=inner,
        )

    def update_B(self, B, s, y, p, g, p_prev, g_prev, state, setting):
        return diagonal_wqc_B_(B=B, s=s, y=y)

    def modify_B(self, B, state, setting):
        return _truncate(B, setting['lb'], setting['ub'])

    def initialize_P(self, size:int, device, dtype, is_inverse:bool): return torch.ones(size, device=device, dtype=dtype)

DiagonalBFGS ¶

Bases: torchzero.modules.quasi_newton.quasi_newton._InverseHessianUpdateStrategyDefaults

Diagonal BFGS. This is simply BFGS with only the diagonal being updated and used. It doesn't satisfy the secant equation but may still be useful.

Source code in torchzero/modules/quasi_newton/diagonal_quasi_newton.py

class DiagonalBFGS(_InverseHessianUpdateStrategyDefaults):
    """Diagonal BFGS. This is simply BFGS with only the diagonal being updated and used. It doesn't satisfy the secant equation but may still be useful."""
    def update_H(self, H, s, y, p, g, p_prev, g_prev, state, setting):
        return diagonal_bfgs_H_(H=H, s=s, y=y, tol=setting['tol'])

    def initialize_P(self, size:int, device, dtype, is_inverse:bool): return torch.ones(size, device=device, dtype=dtype)

DiagonalQuasiCauchi ¶

Bases: torchzero.modules.quasi_newton.quasi_newton._HessianUpdateStrategyDefaults

Diagonal quasi-cauchi method.

Reference

Zhu M., Nazareth J. L., Wolkowicz H. The quasi-Cauchy relation and diagonal updating //SIAM Journal on Optimization. – 1999. – Т. 9. – №. 4. – С. 1192-1204.

Source code in torchzero/modules/quasi_newton/diagonal_quasi_newton.py

class DiagonalQuasiCauchi(_HessianUpdateStrategyDefaults):
    """Diagonal quasi-cauchi method.

    Reference:
        Zhu M., Nazareth J. L., Wolkowicz H. The quasi-Cauchy relation and diagonal updating //SIAM Journal on Optimization. – 1999. – Т. 9. – №. 4. – С. 1192-1204.
    """
    def update_B(self, B, s, y, p, g, p_prev, g_prev, state, setting):
        return diagonal_qc_B_(B=B, s=s, y=y)

    def initialize_P(self, size:int, device, dtype, is_inverse:bool): return torch.ones(size, device=device, dtype=dtype)

DiagonalSR1 ¶

Bases: torchzero.modules.quasi_newton.quasi_newton._InverseHessianUpdateStrategyDefaults

Diagonal SR1. This is simply SR1 with only the diagonal being updated and used. It doesn't satisfy the secant equation but may still be useful.

Source code in torchzero/modules/quasi_newton/diagonal_quasi_newton.py

class DiagonalSR1(_InverseHessianUpdateStrategyDefaults):
    """Diagonal SR1. This is simply SR1 with only the diagonal being updated and used. It doesn't satisfy the secant equation but may still be useful."""
    def update_H(self, H, s, y, p, g, p_prev, g_prev, state, setting):
        return diagonal_sr1_(H=H, s=s, y=y, tol=setting['tol'])
    def update_B(self, B, s, y, p, g, p_prev, g_prev, state, setting):
        return diagonal_sr1_(H=B, s=y, y=s, tol=setting['tol'])

    def initialize_P(self, size:int, device, dtype, is_inverse:bool): return torch.ones(size, device=device, dtype=dtype)

DiagonalWeightedQuasiCauchi ¶

Bases: torchzero.modules.quasi_newton.quasi_newton._HessianUpdateStrategyDefaults

Diagonal quasi-cauchi method.

Reference

Leong, Wah June, Sharareh Enshaei, and Sie Long Kek. "Diagonal quasi-Newton methods via least change updating principle with weighted Frobenius norm." Numerical Algorithms 86 (2021): 1225-1241.

Source code in torchzero/modules/quasi_newton/diagonal_quasi_newton.py

class DiagonalWeightedQuasiCauchi(_HessianUpdateStrategyDefaults):
    """Diagonal quasi-cauchi method.

    Reference:
        Leong, Wah June, Sharareh Enshaei, and Sie Long Kek. "Diagonal quasi-Newton methods via least change updating principle with weighted Frobenius norm." Numerical Algorithms 86 (2021): 1225-1241.
    """
    def update_B(self, B, s, y, p, g, p_prev, g_prev, state, setting):
        return diagonal_wqc_B_(B=B, s=s, y=y)

    def initialize_P(self, size:int, device, dtype, is_inverse:bool): return torch.ones(size, device=device, dtype=dtype)

FletcherVMM ¶

Bases: torchzero.modules.quasi_newton.quasi_newton._InverseHessianUpdateStrategyDefaults

Fletcher's variable metric Quasi-Newton method.

Note

a line search is recommended.

Warning

this uses at least O(N^2) memory.

Reference

Fletcher, R. (1970). A new approach to variable metric algorithms. The Computer Journal, 13(3), 317–322. doi:10.1093/comjnl/13.3.317

Source code in torchzero/modules/quasi_newton/quasi_newton.py

class FletcherVMM(_InverseHessianUpdateStrategyDefaults):
    """
    Fletcher's variable metric Quasi-Newton method.

    Note:
        a line search is recommended.

    Warning:
        this uses at least O(N^2) memory.

    Reference:
        Fletcher, R. (1970). A new approach to variable metric algorithms. The Computer Journal, 13(3), 317–322. doi:10.1093/comjnl/13.3.317
    """
    def update_H(self, H, s, y, p, g, p_prev, g_prev, state, setting):
        return fletcher_vmm_H_(H=H, s=s, y=y, tol=setting['tol'])

GradientCorrection ¶

Bases: torchzero.core.transform.TensorTransform

Estimates gradient at minima along search direction assuming function is quadratic.

This can useful as inner module for second order methods with inexact line search.

Example:¶

L-BFGS with gradient correction

opt = tz.Optimizer(
    model.parameters(),
    tz.m.LBFGS(inner=tz.m.GradientCorrection()),
    tz.m.Backtracking()
)

Reference

HOSHINO, S. (1972). A Formulation of Variable Metric Methods. IMA Journal of Applied Mathematics, 10(3), 394–403. doi:10.1093/imamat/10.3.394

Source code in torchzero/modules/quasi_newton/quasi_newton.py

class GradientCorrection(TensorTransform):
    """
    Estimates gradient at minima along search direction assuming function is quadratic.

    This can useful as inner module for second order methods with inexact line search.

    ## Example:
    L-BFGS with gradient correction

    ```python
    opt = tz.Optimizer(
        model.parameters(),
        tz.m.LBFGS(inner=tz.m.GradientCorrection()),
        tz.m.Backtracking()
    )
    ```

    Reference:
        HOSHINO, S. (1972). A Formulation of Variable Metric Methods. IMA Journal of Applied Mathematics, 10(3), 394–403. doi:10.1093/imamat/10.3.394

    """
    def __init__(self):
        super().__init__()

    def multi_tensor_apply(self, tensors, params, grads, loss, states, settings):
        if 'p_prev' not in states[0]:
            p_prev = unpack_states(states, tensors, 'p_prev', init=params)
            g_prev = unpack_states(states, tensors, 'g_prev', init=tensors)
            return tensors

        p_prev, g_prev = unpack_states(states, tensors, 'p_prev', 'g_prev', cls=TensorList)
        g_hat = gradient_correction(TensorList(tensors), params-p_prev, tensors-g_prev)

        p_prev.copy_(params)
        g_prev.copy_(tensors)
        return g_hat

Greenstadt1 ¶

Bases: torchzero.modules.quasi_newton.quasi_newton._InverseHessianUpdateStrategyDefaults

Greenstadt's first Quasi-Newton method.

Note

a trust region or an accurate line search is recommended.

Warning

this uses at least O(N^2) memory.

Reference

Spedicato, E., & Huang, Z. (1997). Numerical experience with newton-like methods for nonlinear algebraic systems. Computing, 58(1), 69–89. doi:10.1007/bf02684472

Source code in torchzero/modules/quasi_newton/quasi_newton.py

class Greenstadt1(_InverseHessianUpdateStrategyDefaults):
    """Greenstadt's first Quasi-Newton method.

    Note:
        a trust region or an accurate line search is recommended.

    Warning:
        this uses at least O(N^2) memory.

    Reference:
        Spedicato, E., & Huang, Z. (1997). Numerical experience with newton-like methods for nonlinear algebraic systems. Computing, 58(1), 69–89. doi:10.1007/bf02684472
    """
    def update_H(self, H, s, y, p, g, p_prev, g_prev, state, setting):
        return greenstadt1_H_(H=H, s=s, y=y, g_prev=g_prev)

Greenstadt2 ¶

Bases: torchzero.modules.quasi_newton.quasi_newton._InverseHessianUpdateStrategyDefaults

Greenstadt's second Quasi-Newton method.

Note

a line search is recommended.

Warning

this uses at least O(N^2) memory.

Reference

Spedicato, E., & Huang, Z. (1997). Numerical experience with newton-like methods for nonlinear algebraic systems. Computing, 58(1), 69–89. doi:10.1007/bf02684472

Source code in torchzero/modules/quasi_newton/quasi_newton.py

class Greenstadt2(_InverseHessianUpdateStrategyDefaults):
    """Greenstadt's second Quasi-Newton method.

    Note:
        a line search is recommended.

    Warning:
        this uses at least O(N^2) memory.

    Reference:
        Spedicato, E., & Huang, Z. (1997). Numerical experience with newton-like methods for nonlinear algebraic systems. Computing, 58(1), 69–89. doi:10.1007/bf02684472
    """
    def update_H(self, H, s, y, p, g, p_prev, g_prev, state, setting):
        return greenstadt2_H_(H=H, s=s, y=y)

Horisho ¶

Bases: torchzero.modules.quasi_newton.quasi_newton._InverseHessianUpdateStrategyDefaults

Horisho's variable metric Quasi-Newton method.

Note

a line search is recommended.

Warning

this uses at least O(N^2) memory.

Reference

HOSHINO, S. (1972). A Formulation of Variable Metric Methods. IMA Journal of Applied Mathematics, 10(3), 394–403. doi:10.1093/imamat/10.3.394

Source code in torchzero/modules/quasi_newton/quasi_newton.py

class Horisho(_InverseHessianUpdateStrategyDefaults):
    """
    Horisho's variable metric Quasi-Newton method.

    Note:
        a line search is recommended.

    Warning:
        this uses at least O(N^2) memory.

    Reference:
        HOSHINO, S. (1972). A Formulation of Variable Metric Methods. IMA Journal of Applied Mathematics, 10(3), 394–403. doi:10.1093/imamat/10.3.394
    """

    def update_H(self, H, s, y, p, g, p_prev, g_prev, state, setting):
        return hoshino_H_(H=H, s=s, y=y, tol=setting['tol'])

ICUM ¶

Bases: torchzero.modules.quasi_newton.quasi_newton._InverseHessianUpdateStrategyDefaults

Inverse Column-updating Quasi-Newton method. This is computationally cheaper than other Quasi-Newton methods due to only updating one column of the inverse hessian approximation per step.

Note

a line search is recommended.

Warning

this uses at least O(N^2) memory.

Reference

Lopes, V. L., & Martínez, J. M. (1995). Convergence properties of the inverse column-updating method. Optimization Methods & Software, 6(2), 127–144. from https://www.ime.unicamp.br/sites/default/files/pesquisa/relatorios/rp-1993-76.pdf

Source code in torchzero/modules/quasi_newton/quasi_newton.py

class ICUM(_InverseHessianUpdateStrategyDefaults):
    """
    Inverse Column-updating Quasi-Newton method. This is computationally cheaper than other Quasi-Newton methods
    due to only updating one column of the inverse hessian approximation per step.

    Note:
        a line search is recommended.

    Warning:
        this uses at least O(N^2) memory.

    Reference:
        Lopes, V. L., & Martínez, J. M. (1995). Convergence properties of the inverse column-updating method. Optimization Methods & Software, 6(2), 127–144. from https://www.ime.unicamp.br/sites/default/files/pesquisa/relatorios/rp-1993-76.pdf
    """
    def update_H(self, H, s, y, p, g, p_prev, g_prev, state, setting):
        return icum_H_(H=H, s=s, y=y)

LBFGS ¶

Bases: torchzero.core.transform.TensorTransform

Limited-memory BFGS algorithm. A line search or trust region is recommended.

Parameters:

history_size (int, default: 10 ) –

number of past parameter differences and gradient differences to store. Defaults to 10.
ptol (float | None, default: 1e-32 ) –

skips updating the history if maximum absolute value of parameter difference is less than this value. Defaults to 1e-10.
ptol_restart (bool, default: False ) –

If true, whenever parameter difference is less then ptol, L-BFGS state will be reset. Defaults to None.
gtol (float | None, default: 1e-32 ) –

skips updating the history if if maximum absolute value of gradient difference is less than this value. Defaults to 1e-10.
ptol_restart (bool, default: False ) –

If true, whenever gradient difference is less then gtol, L-BFGS state will be reset. Defaults to None.
sy_tol (float | None, default: 1e-32 ) –

history will not be updated whenever s⋅y is less than this value (negative s⋅y means negative curvature)
scale_first (bool, default: True ) –

makes first step, when hessian approximation is not available, small to reduce number of line search iterations. Defaults to True.
update_freq (int, default: 1 ) –

how often to update L-BFGS history. Larger values may be better for stochastic optimization. Defaults to 1.
damping (Union, default: None ) –

damping to use, can be "powell" or "double". Defaults to None.
inner (Chainable | None, default: None ) –

optional inner modules applied after updating L-BFGS history and before preconditioning. Defaults to None.

Examples:¶

L-BFGS with line search

opt = tz.Optimizer(
    model.parameters(),
    tz.m.LBFGS(100),
    tz.m.Backtracking()
)

L-BFGS with trust region

opt = tz.Optimizer(
    model.parameters(),
    tz.m.TrustCG(tz.m.LBFGS())
)

Source code in torchzero/modules/quasi_newton/lbfgs.py

class LBFGS(TensorTransform):
    """Limited-memory BFGS algorithm. A line search or trust region is recommended.

    Args:
        history_size (int, optional):
            number of past parameter differences and gradient differences to store. Defaults to 10.
        ptol (float | None, optional):
            skips updating the history if maximum absolute value of
            parameter difference is less than this value. Defaults to 1e-10.
        ptol_restart (bool, optional):
            If true, whenever parameter difference is less then ``ptol``,
            L-BFGS state will be reset. Defaults to None.
        gtol (float | None, optional):
            skips updating the history if if maximum absolute value of
            gradient difference is less than this value. Defaults to 1e-10.
        ptol_restart (bool, optional):
            If true, whenever gradient difference is less then ``gtol``,
            L-BFGS state will be reset. Defaults to None.
        sy_tol (float | None, optional):
            history will not be updated whenever s⋅y is less than this value (negative s⋅y means negative curvature)
        scale_first (bool, optional):
            makes first step, when hessian approximation is not available,
            small to reduce number of line search iterations. Defaults to True.
        update_freq (int, optional):
            how often to update L-BFGS history. Larger values may be better for stochastic optimization. Defaults to 1.
        damping (DampingStrategyType, optional):
            damping to use, can be "powell" or "double". Defaults to None.
        inner (Chainable | None, optional):
            optional inner modules applied after updating L-BFGS history and before preconditioning. Defaults to None.

    ## Examples:

    L-BFGS with line search
    ```python
    opt = tz.Optimizer(
        model.parameters(),
        tz.m.LBFGS(100),
        tz.m.Backtracking()
    )
    ```

    L-BFGS with trust region
    ```python
    opt = tz.Optimizer(
        model.parameters(),
        tz.m.TrustCG(tz.m.LBFGS())
    )
    ```
    """
    def __init__(
        self,
        history_size=10,
        ptol: float | None = 1e-32,
        ptol_restart: bool = False,
        gtol: float | None = 1e-32,
        gtol_restart: bool = False,
        sy_tol: float = 1e-32,
        scale_first:bool=True,
        update_freq = 1,
        damping: DampingStrategyType = None,
        inner: Chainable | None = None,
    ):
        defaults = dict(
            history_size=history_size,
            scale_first=scale_first,
            ptol=ptol,
            gtol=gtol,
            ptol_restart=ptol_restart,
            gtol_restart=gtol_restart,
            sy_tol=sy_tol,
            damping = damping,
        )
        super().__init__(defaults, inner=inner, update_freq=update_freq)

        self.global_state['s_history'] = deque(maxlen=history_size)
        self.global_state['y_history'] = deque(maxlen=history_size)
        self.global_state['sy_history'] = deque(maxlen=history_size)

    def _reset_self(self):
        self.state.clear()
        self.global_state['step'] = 0
        self.global_state['s_history'].clear()
        self.global_state['y_history'].clear()
        self.global_state['sy_history'].clear()

    def reset(self):
        self._reset_self()
        for c in self.children.values(): c.reset()

    def reset_for_online(self):
        super().reset_for_online()
        self.clear_state_keys('p_prev', 'g_prev')
        self.global_state.pop('step', None)

    @torch.no_grad
    def multi_tensor_update(self, tensors, params, grads, loss, states, settings):
        p = as_tensorlist(params)
        g = as_tensorlist(tensors)
        step = self.global_state.get('step', 0)
        self.global_state['step'] = step + 1

        # history of s and k
        s_history: deque[TensorList] = self.global_state['s_history']
        y_history: deque[TensorList] = self.global_state['y_history']
        sy_history: deque[torch.Tensor] = self.global_state['sy_history']

        ptol = self.defaults['ptol']
        gtol = self.defaults['gtol']
        ptol_restart = self.defaults['ptol_restart']
        gtol_restart = self.defaults['gtol_restart']
        sy_tol = self.defaults['sy_tol']
        damping = self.defaults['damping']

        p_prev, g_prev = unpack_states(states, tensors, 'p_prev', 'g_prev', cls=TensorList)

        # 1st step - there are no previous params and grads, lbfgs will do normalized SGD step
        if step == 0:
            s = None; y = None; sy = None
        else:
            s = p - p_prev
            y = g - g_prev

            if damping is not None:
                s, y = apply_damping(damping, s=s, y=y, g=g, H=self.get_H())

            sy = s.dot(y)
            # damping to be added here

        below_tol = False
        # tolerance on parameter difference to avoid exploding after converging
        if ptol is not None:
            if s is not None and s.abs().global_max() <= ptol:
                if ptol_restart:
                    self._reset_self()
                sy = None
                below_tol = True

        # tolerance on gradient difference to avoid exploding when there is no curvature
        if gtol is not None:
            if y is not None and y.abs().global_max() <= gtol:
                if gtol_restart: self._reset_self()
                sy = None
                below_tol = True

        # store previous params and grads
        if not below_tol:
            p_prev.copy_(p)
            g_prev.copy_(g)

        # update effective preconditioning state
        if sy is not None and sy > sy_tol:
            assert s is not None and y is not None and sy is not None

            s_history.append(s)
            y_history.append(y)
            sy_history.append(sy)

    def get_H(self, objective=...):
        s_history = [tl.to_vec() for tl in self.global_state['s_history']]
        y_history = [tl.to_vec() for tl in self.global_state['y_history']]
        sy_history = self.global_state['sy_history']
        return LBFGSLinearOperator(s_history, y_history, sy_history)

    @torch.no_grad
    def multi_tensor_apply(self, tensors, params, grads, loss, states, settings):
        scale_first = self.defaults['scale_first']

        tensors = as_tensorlist(tensors)

        s_history = self.global_state['s_history']
        y_history = self.global_state['y_history']
        sy_history = self.global_state['sy_history']

        # precondition
        dir = lbfgs_Hx(
            x=tensors,
            s_history=s_history,
            y_history=y_history,
            sy_history=sy_history,
        )

        # scale 1st step
        if scale_first and self.global_state.get('step', 1) == 1:
            dir *= initial_step_size(dir, eps=1e-7)

        return dir

LSR1 ¶

Bases: torchzero.core.transform.TensorTransform

Limited-memory SR1 algorithm. A line search or trust region is recommended.

Parameters:

history_size (int, default: 10 ) –

number of past parameter differences and gradient differences to store. Defaults to 10.
ptol (float | None, default: None ) –

skips updating the history if maximum absolute value of parameter difference is less than this value. Defaults to None.
ptol_restart (bool, default: False ) –

If true, whenever parameter difference is less then ptol, L-SR1 state will be reset. Defaults to None.
gtol (float | None, default: None ) –

skips updating the history if if maximum absolute value of gradient difference is less than this value. Defaults to None.
ptol_restart (bool, default: False ) –

If true, whenever gradient difference is less then gtol, L-SR1 state will be reset. Defaults to None.
scale_first (bool, default: False ) –

makes first step, when hessian approximation is not available, small to reduce number of line search iterations. Defaults to False.
update_freq (int, default: 1 ) –

how often to update L-SR1 history. Larger values may be better for stochastic optimization. Defaults to 1.
damping (Union, default: None ) –

damping to use, can be "powell" or "double". Defaults to None.
compact (bool) –

if True, uses a compact representation verstion of L-SR1. It is much faster computationally, but less stable.
inner (Chainable | None, default: None ) –

optional inner modules applied after updating L-SR1 history and before preconditioning. Defaults to None.

Examples:¶

L-SR1 with line search

opt = tz.Optimizer(
    model.parameters(),
    tz.m.SR1(),
    tz.m.StrongWolfe(c2=0.1, fallback=True)
)

L-SR1 with trust region

opt = tz.Optimizer(
    model.parameters(),
    tz.m.TrustCG(tz.m.LSR1())
)

Source code in torchzero/modules/quasi_newton/lsr1.py

class LSR1(TensorTransform):
    """Limited-memory SR1 algorithm. A line search or trust region is recommended.

    Args:
        history_size (int, optional):
            number of past parameter differences and gradient differences to store. Defaults to 10.
        ptol (float | None, optional):
            skips updating the history if maximum absolute value of
            parameter difference is less than this value. Defaults to None.
        ptol_restart (bool, optional):
            If true, whenever parameter difference is less then ``ptol``,
            L-SR1 state will be reset. Defaults to None.
        gtol (float | None, optional):
            skips updating the history if if maximum absolute value of
            gradient difference is less than this value. Defaults to None.
        ptol_restart (bool, optional):
            If true, whenever gradient difference is less then ``gtol``,
            L-SR1 state will be reset. Defaults to None.
        scale_first (bool, optional):
            makes first step, when hessian approximation is not available,
            small to reduce number of line search iterations. Defaults to False.
        update_freq (int, optional):
            how often to update L-SR1 history. Larger values may be better for stochastic optimization. Defaults to 1.
        damping (DampingStrategyType, optional):
            damping to use, can be "powell" or "double". Defaults to None.
        compact (bool, optional):
            if True, uses a compact representation verstion of L-SR1. It is much faster computationally, but less stable.
        inner (Chainable | None, optional):
            optional inner modules applied after updating L-SR1 history and before preconditioning. Defaults to None.

    ## Examples:

    L-SR1 with line search
    ```python
    opt = tz.Optimizer(
        model.parameters(),
        tz.m.SR1(),
        tz.m.StrongWolfe(c2=0.1, fallback=True)
    )
    ```

    L-SR1 with trust region
    ```python
    opt = tz.Optimizer(
        model.parameters(),
        tz.m.TrustCG(tz.m.LSR1())
    )
    ```
    """
    def __init__(
        self,
        history_size=10,
        ptol: float | None = None,
        ptol_restart: bool = False,
        gtol: float | None = None,
        gtol_restart: bool = False,
        scale_first:bool=False,
        update_freq = 1,
        damping: DampingStrategyType = None,
        inner: Chainable | None = None,
    ):
        defaults = dict(
            history_size=history_size,
            scale_first=scale_first,
            ptol=ptol,
            gtol=gtol,
            ptol_restart=ptol_restart,
            gtol_restart=gtol_restart,
            damping = damping,
        )
        super().__init__(defaults, inner=inner, update_freq=update_freq)

        self.global_state['s_history'] = deque(maxlen=history_size)
        self.global_state['y_history'] = deque(maxlen=history_size)

    def _reset_self(self):
        self.state.clear()
        self.global_state['step'] = 0
        self.global_state['s_history'].clear()
        self.global_state['y_history'].clear()

    def reset(self):
        self._reset_self()
        for c in self.children.values(): c.reset()

    def reset_for_online(self):
        super().reset_for_online()
        self.clear_state_keys('p_prev', 'g_prev')
        self.global_state.pop('step', None)

    @torch.no_grad
    def multi_tensor_update(self, tensors, params, grads, loss, states, settings):
        p = as_tensorlist(params)
        g = as_tensorlist(tensors)
        step = self.global_state.get('step', 0)
        self.global_state['step'] = step + 1

        # history of s and k
        s_history: deque = self.global_state['s_history']
        y_history: deque = self.global_state['y_history']

        ptol = self.defaults['ptol']
        gtol = self.defaults['gtol']
        ptol_restart = self.defaults['ptol_restart']
        gtol_restart = self.defaults['gtol_restart']
        damping = self.defaults['damping']

        p_prev, g_prev = unpack_states(states, tensors, 'p_prev', 'g_prev', cls=TensorList)

        # 1st step - there are no previous params and grads, lsr1 will do normalized SGD step
        if step == 0:
            s = None; y = None; sy = None
        else:
            s = p - p_prev
            y = g - g_prev

            if damping is not None:
                s, y = apply_damping(damping, s=s, y=y, g=g, H=self.get_H())

            sy = s.dot(y)
            # damping to be added here

        below_tol = False
        # tolerance on parameter difference to avoid exploding after converging
        if ptol is not None:
            if s is not None and s.abs().global_max() <= ptol:
                if ptol_restart: self._reset_self()
                sy = None
                below_tol = True

        # tolerance on gradient difference to avoid exploding when there is no curvature
        if gtol is not None:
            if y is not None and y.abs().global_max() <= gtol:
                if gtol_restart: self._reset_self()
                sy = None
                below_tol = True

        # store previous params and grads
        if not below_tol:
            p_prev.copy_(p)
            g_prev.copy_(g)

        # update effective preconditioning state
        if sy is not None:
            assert s is not None and y is not None and sy is not None

            s_history.append(s)
            y_history.append(y)

    def get_H(self, objective=...):
        s_history = [tl.to_vec() for tl in self.global_state['s_history']]
        y_history = [tl.to_vec() for tl in self.global_state['y_history']]
        return LSR1LinearOperator(s_history, y_history)

    @torch.no_grad
    def multi_tensor_apply(self, tensors, params, grads, loss, states, settings):
        scale_first = self.defaults['scale_first']

        tensors = as_tensorlist(tensors)

        s_history = self.global_state['s_history']
        y_history = self.global_state['y_history']

        # precondition
        dir = lsr1_Hx(
            x=tensors,
            s_history=s_history,
            y_history=y_history,
        )

        # scale 1st step
        if scale_first and self.global_state.get('step', 1) == 1:
            dir *= initial_step_size(dir, eps=1e-7)

        return dir

McCormick ¶

Bases: torchzero.modules.quasi_newton.quasi_newton._InverseHessianUpdateStrategyDefaults

McCormicks's Quasi-Newton method.

Note

a line search is recommended.

Warning

this uses at least O(N^2) memory.

Reference

Pearson, J. D. (1969). Variable metric methods of minimisation. The Computer Journal, 12(2), 171–178. doi:10.1093/comjnl/12.2.171.

This is "Algorithm 2", attributed to McCormick in this paper. However for some reason this method is also called Pearson's 2nd method in other sources.

Source code in torchzero/modules/quasi_newton/quasi_newton.py

class McCormick(_InverseHessianUpdateStrategyDefaults):
    """McCormicks's Quasi-Newton method.

    Note:
        a line search is recommended.

    Warning:
        this uses at least O(N^2) memory.

    Reference:
        Pearson, J. D. (1969). Variable metric methods of minimisation. The Computer Journal, 12(2), 171–178. doi:10.1093/comjnl/12.2.171.

        This is "Algorithm 2", attributed to McCormick in this paper. However for some reason this method is also called Pearson's 2nd method in other sources.
    """
    def update_H(self, H, s, y, p, g, p_prev, g_prev, state, setting):
        return mccormick_H_(H=H, s=s, y=y)

NewDQN ¶

Bases: torchzero.modules.quasi_newton.diagonal_quasi_newton.DNRTR

Diagonal quasi-newton method.

Reference

Nosrati, Mahsa, and Keyvan Amini. "A new diagonal quasi-Newton algorithm for unconstrained optimization problems." Applications of Mathematics 69.4 (2024): 501-512.

Source code in torchzero/modules/quasi_newton/diagonal_quasi_newton.py

class NewDQN(DNRTR):
    """Diagonal quasi-newton method.

    Reference:
        Nosrati, Mahsa, and Keyvan Amini. "A new diagonal quasi-Newton algorithm for unconstrained optimization problems." Applications of Mathematics 69.4 (2024): 501-512.
    """
    def update_B(self, B, s, y, p, g, p_prev, g_prev, state, setting):
        return new_dqn_B_(B=B, s=s, y=y)

NewSSM ¶

Bases: torchzero.modules.quasi_newton.quasi_newton.HessianUpdateStrategy

Self-scaling Quasi-Newton method.

Note

a line search such as tz.m.StrongWolfe() is required.

Warning

this uses roughly O(N^2) memory.

Reference

Moghrabi, I. A., Hassan, B. A., & Askar, A. (2022). New self-scaling quasi-newton methods for unconstrained optimization. Int. J. Math. Comput. Sci., 17, 1061U.

Source code in torchzero/modules/quasi_newton/quasi_newton.py

class NewSSM(HessianUpdateStrategy):
    """Self-scaling Quasi-Newton method.

    Note:
        a line search such as ``tz.m.StrongWolfe()`` is required.

    Warning:
        this uses roughly O(N^2) memory.

    Reference:
        Moghrabi, I. A., Hassan, B. A., & Askar, A. (2022). New self-scaling quasi-newton methods for unconstrained optimization. Int. J. Math. Comput. Sci., 17, 1061U.
    """
    def __init__(
        self,
        type: Literal[1, 2] = 1,
        init_scale: float | Literal["auto"] = "auto",
        tol: float = 1e-32,
        ptol: float | None = 1e-32,
        ptol_restart: bool = False,
        gtol: float | None = 1e-32,
        restart_interval: int | None = None,
        beta: float | None = None,
        update_freq: int = 1,
        scale_first: bool = False,
        concat_params: bool = True,
        inner: Chainable | None = None,
    ):
        super().__init__(
            defaults=dict(type=type),
            init_scale=init_scale,
            tol=tol,
            ptol=ptol,
            ptol_restart=ptol_restart,
            gtol=gtol,
            restart_interval=restart_interval,
            beta=beta,
            update_freq=update_freq,
            scale_first=scale_first,
            concat_params=concat_params,
            inverse=True,
            uses_loss=True,
            inner=inner,
        )
    def update_H(self, H, s, y, p, g, p_prev, g_prev, state, setting):
        f = state['f']
        f_prev = state['f_prev']
        return new_ssm1(H=H, s=s, y=y, f=f, f_prev=f_prev, type=setting['type'], tol=setting['tol'])

PSB ¶

Bases: torchzero.modules.quasi_newton.quasi_newton._HessianUpdateStrategyDefaults

Powell's Symmetric Broyden Quasi-Newton method.

Note

a line search or a trust region is recommended.

Warning

this uses at least O(N^2) memory.

Reference

Spedicato, E., & Huang, Z. (1997). Numerical experience with newton-like methods for nonlinear algebraic systems. Computing, 58(1), 69–89. doi:10.1007/bf02684472

Source code in torchzero/modules/quasi_newton/quasi_newton.py

class PSB(_HessianUpdateStrategyDefaults):
    """Powell's Symmetric Broyden Quasi-Newton method.

    Note:
        a line search or a trust region is recommended.

    Warning:
        this uses at least O(N^2) memory.

    Reference:
        Spedicato, E., & Huang, Z. (1997). Numerical experience with newton-like methods for nonlinear algebraic systems. Computing, 58(1), 69–89. doi:10.1007/bf02684472
    """
    def update_B(self, B, s, y, p, g, p_prev, g_prev, state, setting):
        return psb_B_(B=B, s=s, y=y)

Pearson ¶

Bases: torchzero.modules.quasi_newton.quasi_newton._InverseHessianUpdateStrategyDefaults

Pearson's Quasi-Newton method.

Note

a line search is recommended.

Warning

this uses at least O(N^2) memory.

Reference

Pearson, J. D. (1969). Variable metric methods of minimisation. The Computer Journal, 12(2), 171–178. doi:10.1093/comjnl/12.2.171.

Source code in torchzero/modules/quasi_newton/quasi_newton.py

class Pearson(_InverseHessianUpdateStrategyDefaults):
    """
    Pearson's Quasi-Newton method.

    Note:
        a line search is recommended.

    Warning:
        this uses at least O(N^2) memory.

    Reference:
        Pearson, J. D. (1969). Variable metric methods of minimisation. The Computer Journal, 12(2), 171–178. doi:10.1093/comjnl/12.2.171.
    """
    def update_H(self, H, s, y, p, g, p_prev, g_prev, state, setting):
        return pearson_H_(H=H, s=s, y=y)

ProjectedNewtonRaphson ¶

Bases: torchzero.modules.quasi_newton.quasi_newton.HessianUpdateStrategy

Projected Newton Raphson method.

Note

a line search is recommended.

Warning

this uses at least O(N^2) memory.

Reference

Pearson, J. D. (1969). Variable metric methods of minimisation. The Computer Journal, 12(2), 171–178. doi:10.1093/comjnl/12.2.171.

This one is Algorithm 7.

Source code in torchzero/modules/quasi_newton/quasi_newton.py

class ProjectedNewtonRaphson(HessianUpdateStrategy):
    """
    Projected Newton Raphson method.

    Note:
        a line search is recommended.

    Warning:
        this uses at least O(N^2) memory.

    Reference:
        Pearson, J. D. (1969). Variable metric methods of minimisation. The Computer Journal, 12(2), 171–178. doi:10.1093/comjnl/12.2.171.

        This one is Algorithm 7.
    """
    def __init__(
        self,
        init_scale: float | Literal["auto"] = 'auto',
        tol: float = 1e-32,
        ptol: float | None = 1e-32,
        ptol_restart: bool = False,
        gtol: float | None = 1e-32,
        restart_interval: int | None | Literal['auto'] = 'auto',
        beta: float | None = None,
        update_freq: int = 1,
        scale_first: bool = False,
        concat_params: bool = True,
        inner: Chainable | None = None,
    ):
        super().__init__(
            init_scale=init_scale,
            tol=tol,
            ptol = ptol,
            ptol_restart=ptol_restart,
            gtol=gtol,
            restart_interval=restart_interval,
            beta=beta,
            update_freq=update_freq,
            scale_first=scale_first,
            concat_params=concat_params,
            inverse=True,
            inner=inner,
        )

    def update_H(self, H, s, y, p, g, p_prev, g_prev, state, setting):
        if 'R' not in state: state['R'] = torch.eye(H.size(-1), device=H.device, dtype=H.dtype)
        H, R = projected_newton_raphson_H_(H=H, R=state['R'], s=s, y=y)
        state["R"] = R
        return H

    def reset_P(self, P, s, y, inverse, init_scale, state):
        assert inverse
        if 'R' not in state: state['R'] = torch.eye(P.size(-1), device=P.device, dtype=P.dtype)
        P.copy_(state["R"])

SG2 ¶

Bases: torchzero.core.transform.Transform

second-order stochastic gradient

2SPSA (second-order SPSA)

opt = tz.Optimizer(
    model.parameters(),
    tz.m.SPSA(),
    tz.m.SG2(),
    tz.m.LR(1e-2),
)

SG2 with line search

opt = tz.Optimizer(
    model.parameters(),
    tz.m.SG2(),
    tz.m.Backtracking()
)

SG2 with trust region

opt = tz.Optimizer(
    model.parameters(),
    tz.m.LevenbergMarquardt(tz.m.SG2(beta=0.75. n_samples=4)),
)

Source code in torchzero/modules/quasi_newton/sg2.py

class SG2(Transform):
    """second-order stochastic gradient

    2SPSA (second-order SPSA)
    ```python
    opt = tz.Optimizer(
        model.parameters(),
        tz.m.SPSA(),
        tz.m.SG2(),
        tz.m.LR(1e-2),
    )
    ```

    SG2 with line search
    ```python
    opt = tz.Optimizer(
        model.parameters(),
        tz.m.SG2(),
        tz.m.Backtracking()
    )
    ```

    SG2 with trust region
    ```python
    opt = tz.Optimizer(
        model.parameters(),
        tz.m.LevenbergMarquardt(tz.m.SG2(beta=0.75. n_samples=4)),
    )
    ```

    """

    def __init__(
        self,
        n_samples: int = 1,
        n_first_step_samples: int = 10,
        start_step: int = 10,
        beta: float | None = None,
        damping: float = 1e-4,
        h: float = 1e-2,
        seed=None,
        update_freq: int = 1,
        inner: Chainable | None = None,
    ):
        defaults = dict(n_samples=n_samples, h=h, beta=beta, damping=damping, seed=seed, start_step=start_step, n_first_step_samples=n_first_step_samples)
        super().__init__(defaults, update_freq=update_freq, inner=inner)

    @torch.no_grad
    def update_states(self, objective, states, settings):
        fs = settings[0]
        k = self.increment_counter("step", 0)

        params = TensorList(objective.params)
        closure = objective.closure
        if closure is None:
            raise RuntimeError("closure is required for SG2")
        generator = self.get_generator(params[0].device, self.defaults["seed"])

        h = unpack_dicts(settings, "h")
        x_0 = params.clone()
        n_samples = fs["n_samples"]
        if k == 0: n_samples = fs["n_first_step_samples"]
        H_hat = None

        # compute new approximation
        for i in range(n_samples):
            # generate perturbation
            cd = params.rademacher_like(generator=generator).mul_(h)

            # two sided hessian approximation
            params.add_(cd)
            closure()
            g_p = params.grad.fill_none_(params)

            params.copy_(x_0)
            params.sub_(cd)
            closure()
            g_n = params.grad.fill_none_(params)

            delta_g = g_p - g_n

            # restore params
            params.set_(x_0)

            # compute H hat
            H_i = sg2_(
                delta_g = delta_g.to_vec(),
                cd = cd.to_vec(),
            )

            if H_hat is None: H_hat = H_i
            else: H_hat += H_i

        assert H_hat is not None
        if n_samples > 1: H_hat /= n_samples

        # add damping
        if fs["damping"] != 0:
            reg = torch.eye(H_hat.size(0), device=H_hat.device, dtype=H_hat.dtype).mul_(fs["damping"])
            H_hat += reg

        # update H
        H = self.global_state.get("H", None)
        if H is None: H = H_hat
        else:
            beta = fs["beta"]
            if beta is None: beta = (k+1) / (k+2)
            H.lerp_(H_hat, 1-beta)

        self.global_state["H"] = H


    @torch.no_grad
    def apply_states(self, objective, states, settings):
        fs = settings[0]
        updates = objective.get_updates()

        H: torch.Tensor = self.global_state["H"]
        k = self.global_state["step"]
        if k < fs["start_step"]:
            # don't precondition yet
            # I guess we can try using trace to scale the update
            # because it will have horrible scaling otherwise
            torch._foreach_div_(updates, H.trace())
            return objective

        b = torch.cat([t.ravel() for t in updates])
        sol = torch.linalg.lstsq(H, b).solution # pylint:disable=not-callable

        vec_to_tensors_(sol, updates)
        return objective

    def get_H(self, objective=...):
        return Dense(self.global_state["H"])

SR1 ¶

Bases: torchzero.modules.quasi_newton.quasi_newton._InverseHessianUpdateStrategyDefaults

Symmetric Rank 1. This works best with a trust region:

tz.m.LevenbergMarquardt(tz.m.SR1(inverse=False))

Parameters:

init_scale (float | Literal['auto'], default: 'auto' ) –

initial hessian matrix is set to identity times this.

"auto" corresponds to a heuristic from [1] p.142-143.

Defaults to "auto".
tol (float, default: 1e-32 ) –

tolerance for denominator in SR1 update rule as in [1] p.146. Defaults to 1e-32.
ptol (float | None, default: 1e-32 ) –

skips update if maximum difference between current and previous gradients is less than this, to avoid instability. Defaults to 1e-32.
ptol_restart (bool, default: False ) –

whether to reset the hessian approximation when ptol tolerance is not met. Defaults to False.
restart_interval (int | None | Literal['auto'], default: None ) –

interval between resetting the hessian approximation.

"auto" corresponds to number of decision variables + 1.

None - no resets.

Defaults to None.
beta (float | None, default: None ) –

momentum on H or B. Defaults to None.
update_freq (int, default: 1 ) –

frequency of updating H or B. Defaults to 1.
scale_first (bool, default: False ) –

whether to downscale first step before hessian approximation becomes available. Defaults to True.
scale_second (bool) –

whether to downscale second step. Defaults to False.
concat_params (bool, default: True ) –

If true, all parameters are treated as a single vector. If False, the update rule is applied to each parameter separately. Defaults to True.
inner (Chainable | None, default: None ) –

preconditioning is applied to the output of this module. Defaults to None.

Examples:¶

SR1 with trust region

opt = tz.Optimizer(
    model.parameters(),
    tz.m.LevenbergMarquardt(tz.m.SR1(inverse=False)),
)

References:¶

[1]. Nocedal. Stephen J. Wright. Numerical Optimization

Source code in torchzero/modules/quasi_newton/quasi_newton.py

class SR1(_InverseHessianUpdateStrategyDefaults):
    """Symmetric Rank 1. This works best with a trust region:
    ```python
    tz.m.LevenbergMarquardt(tz.m.SR1(inverse=False))
    ```

    Args:
        init_scale (float | Literal["auto"], optional):
            initial hessian matrix is set to identity times this.

            "auto" corresponds to a heuristic from [1] p.142-143.

            Defaults to "auto".
        tol (float, optional):
            tolerance for denominator in SR1 update rule as in [1] p.146. Defaults to 1e-32.
        ptol (float | None, optional):
            skips update if maximum difference between current and previous gradients is less than this, to avoid instability.
            Defaults to 1e-32.
        ptol_restart (bool, optional): whether to reset the hessian approximation when ptol tolerance is not met. Defaults to False.
        restart_interval (int | None | Literal["auto"], optional):
            interval between resetting the hessian approximation.

            "auto" corresponds to number of decision variables + 1.

            None - no resets.

            Defaults to None.
        beta (float | None, optional): momentum on H or B. Defaults to None.
        update_freq (int, optional): frequency of updating H or B. Defaults to 1.
        scale_first (bool, optional):
            whether to downscale first step before hessian approximation becomes available. Defaults to True.
        scale_second (bool, optional): whether to downscale second step. Defaults to False.
        concat_params (bool, optional):
            If true, all parameters are treated as a single vector.
            If False, the update rule is applied to each parameter separately. Defaults to True.
        inner (Chainable | None, optional): preconditioning is applied to the output of this module. Defaults to None.

    ### Examples:

    SR1 with trust region
    ```python
    opt = tz.Optimizer(
        model.parameters(),
        tz.m.LevenbergMarquardt(tz.m.SR1(inverse=False)),
    )
    ```

    ###  References:
        [1]. Nocedal. Stephen J. Wright. Numerical Optimization
    """

    def update_H(self, H, s, y, p, g, p_prev, g_prev, state, setting):
        return sr1_(H=H, s=s, y=y, tol=setting['tol'])
    def update_B(self, B, s, y, p, g, p_prev, g_prev, state, setting):
        return sr1_(H=B, s=y, y=s, tol=setting['tol'])

SSVM ¶

Bases: torchzero.modules.quasi_newton.quasi_newton.HessianUpdateStrategy

Self-scaling variable metric Quasi-Newton method.

Note

a line search is recommended.

Warning

this uses at least O(N^2) memory.

Reference

Oren, S. S., & Spedicato, E. (1976). Optimal conditioning of self-scaling variable Metric algorithms. Mathematical Programming, 10(1), 70–90. doi:10.1007/bf01580654

Source code in torchzero/modules/quasi_newton/quasi_newton.py

class SSVM(HessianUpdateStrategy):
    """
    Self-scaling variable metric Quasi-Newton method.

    Note:
        a line search is recommended.

    Warning:
        this uses at least O(N^2) memory.

    Reference:
        Oren, S. S., & Spedicato, E. (1976). Optimal conditioning of self-scaling variable Metric algorithms. Mathematical Programming, 10(1), 70–90. doi:10.1007/bf01580654
    """
    def __init__(
        self,
        switch: tuple[float,float] | Literal[1,2,3,4] = 3,
        init_scale: float | Literal["auto"] = 'auto',
        tol: float = 1e-32,
        ptol: float | None = 1e-32,
        ptol_restart: bool = False,
        gtol: float | None = 1e-32,
        restart_interval: int | None = None,
        beta: float | None = None,
        update_freq: int = 1,
        scale_first: bool = False,
        concat_params: bool = True,
        inner: Chainable | None = None,
    ):
        defaults = dict(switch=switch)
        super().__init__(
            defaults=defaults,
            init_scale=init_scale,
            tol=tol,
            ptol=ptol,
            ptol_restart=ptol_restart,
            gtol=gtol,
            restart_interval=restart_interval,
            beta=beta,
            update_freq=update_freq,
            scale_first=scale_first,
            concat_params=concat_params,
            inverse=True,
            inner=inner,
        )

    def update_H(self, H, s, y, p, g, p_prev, g_prev, state, setting):
        return ssvm_H_(H=H, s=s, y=y, g=g, switch=setting['switch'], tol=setting['tol'])

ShorR ¶

Bases: torchzero.modules.quasi_newton.quasi_newton.HessianUpdateStrategy

Shor’s r-algorithm.

Note

A line search such as [tz.m.StrongWolfe(a_init="quadratic", fallback=True), tz.m.Mul(1.2)] is required. Similarly to conjugate gradient, ShorR doesn't have an automatic step size scaling, so setting a_init in the line search is recommended.
The line search should try to overstep by a little, therefore it can help to multiply direction given by a line search by some value slightly larger than 1 such as 1.2.

References

Those are the original references, but neither seem to be available online: - Shor, N. Z., Utilization of the Operation of Space Dilatation in the Minimization of Convex Functions, Kibernetika, No. 1, pp. 6-12, 1970.

- Skokov, V. A., Note on Minimization Methods Employing Space Stretching, Kibernetika, No. 4, pp. 115-117, 1974.

An overview is available in Burke, James V., Adrian S. Lewis, and Michael L. Overton. "The Speed of Shor's R-algorithm." IMA Journal of numerical analysis 28.4 (2008): 711-720.

Reference by Skokov, V. A. describes a more efficient formula which can be found here Ansari, Zafar A. Limited Memory Space Dilation and Reduction Algorithms. Diss. Virginia Tech, 1998.

Source code in torchzero/modules/quasi_newton/quasi_newton.py

class ShorR(HessianUpdateStrategy):
    """Shor’s r-algorithm.

    Note:
        - A line search such as ``[tz.m.StrongWolfe(a_init="quadratic", fallback=True), tz.m.Mul(1.2)]`` is required. Similarly to conjugate gradient, ShorR doesn't have an automatic step size scaling, so setting ``a_init`` in the line search is recommended.

        - The line search should try to overstep by a little, therefore it can help to multiply direction given by a line search by some value slightly larger than 1 such as 1.2.

    References:
        Those are the original references, but neither seem to be available online:
            - Shor, N. Z., Utilization of the Operation of Space Dilatation in the Minimization of Convex Functions, Kibernetika, No. 1, pp. 6-12, 1970.

            - Skokov, V. A., Note on Minimization Methods Employing Space Stretching, Kibernetika, No. 4, pp. 115-117, 1974.

        An overview is available in [Burke, James V., Adrian S. Lewis, and Michael L. Overton. "The Speed of Shor's R-algorithm." IMA Journal of numerical analysis 28.4 (2008): 711-720](https://sites.math.washington.edu/~burke/papers/reprints/60-speed-Shor-R.pdf).

        Reference by Skokov, V. A. describes a more efficient formula which can be found here [Ansari, Zafar A. Limited Memory Space Dilation and Reduction Algorithms. Diss. Virginia Tech, 1998.](https://camo.ici.ro/books/thesis/th.pdf)
    """

    def __init__(
        self,
        alpha=0.5,
        init_scale: float | Literal["auto"] = 1,
        tol: float = 1e-32,
        ptol: float | None = 1e-32,
        ptol_restart: bool = False,
        gtol: float | None = 1e-32,
        restart_interval: int | None | Literal['auto'] = None,
        beta: float | None = None,
        update_freq: int = 1,
        scale_first: bool = False,
        concat_params: bool = True,
        # inverse: bool = True,
        inner: Chainable | None = None,
    ):
        defaults = dict(alpha=alpha)
        super().__init__(
            defaults=defaults,
            init_scale=init_scale,
            tol=tol,
            ptol=ptol,
            ptol_restart=ptol_restart,
            gtol=gtol,
            restart_interval=restart_interval,
            beta=beta,
            update_freq=update_freq,
            scale_first=scale_first,
            concat_params=concat_params,
            inverse=True,
            inner=inner,
        )

    def update_H(self, H, s, y, p, g, p_prev, g_prev, state, setting):
        return shor_r_(H=H, y=y, alpha=setting['alpha'])

ThomasOptimalMethod ¶

Bases: torchzero.modules.quasi_newton.quasi_newton._InverseHessianUpdateStrategyDefaults

Thomas's "optimal" Quasi-Newton method.

Note

a line search is recommended.

Warning

this uses at least O(N^2) memory.

Reference

Thomas, Stephen Walter. Sequential estimation techniques for quasi-Newton algorithms. Cornell University, 1975.

Source code in torchzero/modules/quasi_newton/quasi_newton.py

class ThomasOptimalMethod(_InverseHessianUpdateStrategyDefaults):
    """
    Thomas's "optimal" Quasi-Newton method.

    Note:
        a line search is recommended.

    Warning:
        this uses at least O(N^2) memory.

    Reference:
        Thomas, Stephen Walter. Sequential estimation techniques for quasi-Newton algorithms. Cornell University, 1975.
    """
    def update_H(self, H, s, y, p, g, p_prev, g_prev, state, setting):
        if 'R' not in state: state['R'] = torch.eye(H.size(-1), device=H.device, dtype=H.dtype)
        H, state['R'] = thomas_H_(H=H, R=state['R'], s=s, y=y)
        return H

    def reset_P(self, P, s, y, inverse, init_scale, state):
        super().reset_P(P, s, y, inverse, init_scale, state)
        for st in self.state.values():
            st.pop("R", None)

Quasi-newton methods¶

See also¶

BFGS ¶

Examples:¶

BroydenBad ¶

BroydenGood ¶

DFP ¶

DNRTR ¶

DiagonalBFGS ¶

DiagonalQuasiCauchi ¶

DiagonalSR1 ¶

DiagonalWeightedQuasiCauchi ¶

FletcherVMM ¶

GradientCorrection ¶

Example:¶

Greenstadt1 ¶

Greenstadt2 ¶

Horisho ¶

ICUM ¶

LBFGS ¶

Examples:¶

LSR1 ¶

Examples:¶

McCormick ¶

NewDQN ¶

NewSSM ¶

PSB ¶

Pearson ¶

ProjectedNewtonRaphson ¶

SG2 ¶

SR1 ¶

Examples:¶

References:¶

SSVM ¶

ShorR ¶

ThomasOptimalMethod ¶