Skip to content

Second order

This subpackage contains "True" second order methods that use exact second order information.

See also

  • Quasi-newton - quasi-newton methods that estimate the hessian using gradient information.
  • Higher-order - methods of third and higher order.

Classes:

  • InverseFreeNewton

    Inverse-free newton's method

  • Newton

    Exact newton's method via autograd.

  • NewtonCG

    Newton's method with a matrix-free conjugate gradient or minimial-residual solver.

  • NewtonCGSteihaug

    Newton's method with trust region and a matrix-free Steihaug-Toint conjugate gradient solver.

  • NystromPCG

    Newton's method with a Nyström-preconditioned conjugate gradient solver.

  • NystromSketchAndSolve

    Newton's method with a Nyström sketch-and-solve solver.

  • SixthOrder3P

    Sixth-order iterative method.

  • SixthOrder3PM2

    Wang, Xiaofeng, and Yang Li. "An efficient sixth-order Newton-type method for solving nonlinear systems." Algorithms 10.2 (2017): 45.

  • SixthOrder5P

    Argyros, Ioannis K., et al. "Extended convergence for two sixth order methods under the same weak conditions." Foundations 3.1 (2023): 127-139.

  • TwoPointNewton

    two-point Newton method with frozen derivative with third order convergence.

InverseFreeNewton

Bases: torchzero.core.module.Module

Inverse-free newton's method

.. note:: In most cases Newton should be the first module in the chain because it relies on autograd. Use the :code:inner argument if you wish to apply Newton preconditioning to another module's output.

.. note:: This module requires the a closure passed to the optimizer step, as it needs to re-evaluate the loss and gradients for calculating the hessian. The closure must accept a backward argument (refer to documentation).

.. warning:: this uses roughly O(N^2) memory.

Reference Massalski, Marcin, and Magdalena Nockowska-Rosiak. "INVERSE-FREE NEWTON'S METHOD." Journal of Applied Analysis & Computation 15.4 (2025): 2238-2257.

Source code in torchzero/modules/second_order/newton.py
class InverseFreeNewton(Module):
    """Inverse-free newton's method

    .. note::
        In most cases Newton should be the first module in the chain because it relies on autograd. Use the :code:`inner` argument if you wish to apply Newton preconditioning to another module's output.

    .. note::
        This module requires the a closure passed to the optimizer step,
        as it needs to re-evaluate the loss and gradients for calculating the hessian.
        The closure must accept a ``backward`` argument (refer to documentation).

    .. warning::
        this uses roughly O(N^2) memory.

    Reference
        Massalski, Marcin, and Magdalena Nockowska-Rosiak. "INVERSE-FREE NEWTON'S METHOD." Journal of Applied Analysis & Computation 15.4 (2025): 2238-2257.
    """
    def __init__(
        self,
        update_freq: int = 1,
        hessian_method: Literal["autograd", "func", "autograd.functional"] = "autograd",
        vectorize: bool = True,
        inner: Chainable | None = None,
    ):
        defaults = dict(hessian_method=hessian_method, vectorize=vectorize, update_freq=update_freq)
        super().__init__(defaults)

        if inner is not None:
            self.set_child('inner', inner)

    @torch.no_grad
    def update(self, var):
        params = TensorList(var.params)
        closure = var.closure
        if closure is None: raise RuntimeError('NewtonCG requires closure')

        settings = self.settings[params[0]]
        hessian_method = settings['hessian_method']
        vectorize = settings['vectorize']
        update_freq = settings['update_freq']

        step = self.global_state.get('step', 0)
        self.global_state['step'] = step + 1

        g_list = var.grad
        Y = None
        if step % update_freq == 0:
            # ------------------------ calculate grad and hessian ------------------------ #
            if hessian_method == 'autograd':
                with torch.enable_grad():
                    loss = var.loss = var.loss_approx = closure(False)
                    g_list, H_list = jacobian_and_hessian_wrt([loss], params, batched=vectorize)
                    g_list = [t[0] for t in g_list] # remove leading dim from loss
                    var.grad = g_list
                    H = flatten_jacobian(H_list)

            elif hessian_method in ('func', 'autograd.functional'):
                strat = 'forward-mode' if vectorize else 'reverse-mode'
                with torch.enable_grad():
                    g_list = var.get_grad(retain_graph=True)
                    H = hessian_mat(partial(closure, backward=False), params,
                                    method=hessian_method, vectorize=vectorize, outer_jacobian_strategy=strat) # pyright:ignore[reportAssignmentType]

            else:
                raise ValueError(hessian_method)

            self.global_state["H"] = H

            # inverse free part
            if 'Y' not in self.global_state:
                num = H.T
                denom = (torch.linalg.norm(H, 1) * torch.linalg.norm(H, float('inf'))) # pylint:disable=not-callable
                finfo = torch.finfo(H.dtype)
                Y = self.global_state['Y'] = num.div_(denom.clip(min=finfo.tiny * 2, max=finfo.max / 2))

            else:
                Y = self.global_state['Y']
                I = torch.eye(Y.size(0), device=Y.device, dtype=Y.dtype).mul_(2)
                I -= H @ Y
                Y = self.global_state['Y'] = Y @ I


    def apply(self, var):
        Y = self.global_state["Y"]
        params = var.params

        # -------------------------------- inner step -------------------------------- #
        update = var.get_update()
        if 'inner' in self.children:
            update = apply_transform(self.children['inner'], update, params=params, grads=var.grad, var=var)

        g = torch.cat([t.ravel() for t in update])

        # ----------------------------------- solve ---------------------------------- #
        var.update = vec_to_tensors(Y@g, params)

        return var

    def get_H(self,var):
        return DenseWithInverse(A = self.global_state["H"], A_inv=self.global_state["Y"])

Newton

Bases: torchzero.core.module.Module

Exact newton's method via autograd.

Newton's method produces a direction jumping to the stationary point of quadratic approximation of the target function. The update rule is given by (H + yI)⁻¹g, where H is the hessian and g is the gradient, y is the damping parameter. g can be output of another module, if it is specifed in inner argument.

Note

In most cases Newton should be the first module in the chain because it relies on autograd. Use the :code:inner argument if you wish to apply Newton preconditioning to another module's output.

Note

This module requires the a closure passed to the optimizer step, as it needs to re-evaluate the loss and gradients for calculating the hessian. The closure must accept a backward argument (refer to documentation).

Parameters:

  • damping (float, default: 0 ) –

    tikhonov regularizer value. Set this to 0 when using trust region. Defaults to 0.

  • search_negative ((bool, Optional), default: False ) –

    if True, whenever a negative eigenvalue is detected, search direction is proposed along weighted sum of eigenvectors corresponding to negative eigenvalues.

  • use_lstsq ((bool, Optional), default: False ) –

    if True, least squares will be used to solve the linear system, this may generate reasonable directions when hessian is not invertible. If False, tries cholesky, if it fails tries LU, and then least squares. If eigval_fn is specified, eigendecomposition will always be used to solve the linear system and this argument will be ignored.

  • hessian_method (str, default: 'autograd' ) –

    how to calculate hessian. Defaults to "autograd".

  • vectorize (bool, default: True ) –

    whether to enable vectorized hessian. Defaults to True.

  • inner (Chainable | None, default: None ) –

    modules to apply hessian preconditioner to. Defaults to None.

  • H_tfm (Callable | None, default: None ) –

    optional hessian transforms, takes in two arguments - (hessian, gradient).

    must return either a tuple: (hessian, is_inverted) with transformed hessian and a boolean value which must be True if transform inverted the hessian and False otherwise.

    Or it returns a single tensor which is used as the update.

    Defaults to None.

  • eigval_fn (Callable | None, default: None ) –

    optional eigenvalues transform, for example torch.abs or lambda L: torch.clip(L, min=1e-8). If this is specified, eigendecomposition will be used to invert the hessian.

See also

  • tz.m.NewtonCG: uses a matrix-free conjugate gradient solver and hessian-vector products, useful for large scale problems as it doesn't form the full hessian.
  • tz.m.NewtonCGSteihaug: trust region version of tz.m.NewtonCG.
  • tz.m.InverseFreeNewton: an inverse-free variant of Newton's method.
  • tz.m.quasi_newton: large collection of quasi-newton methods that estimate the hessian.

Notes

Implementation details

(H + yI)⁻¹g is calculated by solving the linear system (H + yI)x = g. The linear system is solved via cholesky decomposition, if that fails, LU decomposition, and if that fails, least squares. Least squares can be forced by setting use_lstsq=True, which may generate better search directions when linear system is overdetermined.

Additionally, if eigval_fn is specified or search_negative is True, eigendecomposition of the hessian is computed, eigval_fn is applied to the eigenvalues, and (H + yI)⁻¹ is computed using the computed eigenvectors and transformed eigenvalues. This is more generally more computationally expensive.

Handling non-convexity

Standard Newton's method does not handle non-convexity well without some modifications. This is because it jumps to the stationary point, which may be the maxima of the quadratic approximation.

The first modification to handle non-convexity is to modify the eignevalues to be positive, for example by setting eigval_fn = lambda L: L.abs().clip(min=1e-4).

Second modification is search_negative=True, which will search along a negative curvature direction if one is detected. This also requires an eigendecomposition.

The Newton direction can also be forced to be a descent direction by using tz.m.GradSign() or tz.m.Cautious, but that may be significantly less efficient.

Examples:

Newton's method with backtracking line search

opt = tz.Modular(
    model.parameters(),
    tz.m.Newton(),
    tz.m.Backtracking()
)

Newton preconditioning applied to momentum

opt = tz.Modular(
    model.parameters(),
    tz.m.Newton(inner=tz.m.EMA(0.9)),
    tz.m.LR(0.1)
)

Diagonal newton example. This will still evaluate the entire hessian so it isn't efficient, but if you wanted to see how diagonal newton behaves or compares to full newton, you can use this.

opt = tz.Modular(
    model.parameters(),
    tz.m.Newton(H_tfm = lambda H, g: g/H.diag()),
    tz.m.Backtracking()
)
Source code in torchzero/modules/second_order/newton.py
class Newton(Module):
    """Exact newton's method via autograd.

    Newton's method produces a direction jumping to the stationary point of quadratic approximation of the target function.
    The update rule is given by ``(H + yI)⁻¹g``, where ``H`` is the hessian and ``g`` is the gradient, ``y`` is the ``damping`` parameter.
    ``g`` can be output of another module, if it is specifed in ``inner`` argument.

    Note:
        In most cases Newton should be the first module in the chain because it relies on autograd. Use the :code:`inner` argument if you wish to apply Newton preconditioning to another module's output.

    Note:
        This module requires the a closure passed to the optimizer step,
        as it needs to re-evaluate the loss and gradients for calculating the hessian.
        The closure must accept a ``backward`` argument (refer to documentation).

    Args:
        damping (float, optional): tikhonov regularizer value. Set this to 0 when using trust region. Defaults to 0.
        search_negative (bool, Optional):
            if True, whenever a negative eigenvalue is detected,
            search direction is proposed along weighted sum of eigenvectors corresponding to negative eigenvalues.
        use_lstsq (bool, Optional):
            if True, least squares will be used to solve the linear system, this may generate reasonable directions
            when hessian is not invertible. If False, tries cholesky, if it fails tries LU, and then least squares.
            If ``eigval_fn`` is specified, eigendecomposition will always be used to solve the linear system and this
            argument will be ignored.
        hessian_method (str):
            how to calculate hessian. Defaults to "autograd".
        vectorize (bool, optional):
            whether to enable vectorized hessian. Defaults to True.
        inner (Chainable | None, optional): modules to apply hessian preconditioner to. Defaults to None.
        H_tfm (Callable | None, optional):
            optional hessian transforms, takes in two arguments - `(hessian, gradient)`.

            must return either a tuple: `(hessian, is_inverted)` with transformed hessian and a boolean value
            which must be True if transform inverted the hessian and False otherwise.

            Or it returns a single tensor which is used as the update.

            Defaults to None.
        eigval_fn (Callable | None, optional):
            optional eigenvalues transform, for example ``torch.abs`` or ``lambda L: torch.clip(L, min=1e-8)``.
            If this is specified, eigendecomposition will be used to invert the hessian.

    # See also

    * ``tz.m.NewtonCG``: uses a matrix-free conjugate gradient solver and hessian-vector products,
    useful for large scale problems as it doesn't form the full hessian.
    * ``tz.m.NewtonCGSteihaug``: trust region version of ``tz.m.NewtonCG``.
    * ``tz.m.InverseFreeNewton``: an inverse-free variant of Newton's method.
    * ``tz.m.quasi_newton``: large collection of quasi-newton methods that estimate the hessian.

    # Notes

    ## Implementation details

    ``(H + yI)⁻¹g`` is calculated by solving the linear system ``(H + yI)x = g``.
    The linear system is solved via cholesky decomposition, if that fails, LU decomposition, and if that fails, least squares.
    Least squares can be forced by setting ``use_lstsq=True``, which may generate better search directions when linear system is overdetermined.

    Additionally, if ``eigval_fn`` is specified or ``search_negative`` is ``True``,
    eigendecomposition of the hessian is computed, ``eigval_fn`` is applied to the eigenvalues,
    and ``(H + yI)⁻¹`` is computed using the computed eigenvectors and transformed eigenvalues.
    This is more generally more computationally expensive.

    ## Handling non-convexity

    Standard Newton's method does not handle non-convexity well without some modifications.
    This is because it jumps to the stationary point, which may be the maxima of the quadratic approximation.

    The first modification to handle non-convexity is to modify the eignevalues to be positive,
    for example by setting ``eigval_fn = lambda L: L.abs().clip(min=1e-4)``.

    Second modification is ``search_negative=True``, which will search along a negative curvature direction if one is detected.
    This also requires an eigendecomposition.

    The Newton direction can also be forced to be a descent direction by using ``tz.m.GradSign()`` or ``tz.m.Cautious``,
    but that may be significantly less efficient.

    # Examples:

    Newton's method with backtracking line search

    ```py
    opt = tz.Modular(
        model.parameters(),
        tz.m.Newton(),
        tz.m.Backtracking()
    )
    ```

    Newton preconditioning applied to momentum

    ```py
    opt = tz.Modular(
        model.parameters(),
        tz.m.Newton(inner=tz.m.EMA(0.9)),
        tz.m.LR(0.1)
    )
    ```

    Diagonal newton example. This will still evaluate the entire hessian so it isn't efficient,
    but if you wanted to see how diagonal newton behaves or compares to full newton, you can use this.

    ```py
    opt = tz.Modular(
        model.parameters(),
        tz.m.Newton(H_tfm = lambda H, g: g/H.diag()),
        tz.m.Backtracking()
    )
    ```

    """
    def __init__(
        self,
        damping: float = 0,
        search_negative: bool = False,
        use_lstsq: bool = False,
        update_freq: int = 1,
        hessian_method: Literal["autograd", "func", "autograd.functional"] = "autograd",
        vectorize: bool = True,
        inner: Chainable | None = None,
        H_tfm: Callable[[torch.Tensor, torch.Tensor], tuple[torch.Tensor, bool]] | Callable[[torch.Tensor, torch.Tensor], torch.Tensor] | None = None,
        eigval_fn: Callable[[torch.Tensor], torch.Tensor] | None = None,
    ):
        defaults = dict(damping=damping, hessian_method=hessian_method, use_lstsq=use_lstsq, vectorize=vectorize, H_tfm=H_tfm, eigval_fn=eigval_fn, search_negative=search_negative, update_freq=update_freq)
        super().__init__(defaults)

        if inner is not None:
            self.set_child('inner', inner)

    @torch.no_grad
    def update(self, var):
        params = TensorList(var.params)
        closure = var.closure
        if closure is None: raise RuntimeError('NewtonCG requires closure')

        settings = self.settings[params[0]]
        damping = settings['damping']
        hessian_method = settings['hessian_method']
        vectorize = settings['vectorize']
        update_freq = settings['update_freq']

        step = self.global_state.get('step', 0)
        self.global_state['step'] = step + 1

        g_list = var.grad
        H = None
        if step % update_freq == 0:
            # ------------------------ calculate grad and hessian ------------------------ #
            if hessian_method == 'autograd':
                with torch.enable_grad():
                    loss = var.loss = var.loss_approx = closure(False)
                    g_list, H_list = jacobian_and_hessian_wrt([loss], params, batched=vectorize)
                    g_list = [t[0] for t in g_list] # remove leading dim from loss
                    var.grad = g_list
                    H = flatten_jacobian(H_list)

            elif hessian_method in ('func', 'autograd.functional'):
                strat = 'forward-mode' if vectorize else 'reverse-mode'
                with torch.enable_grad():
                    g_list = var.get_grad(retain_graph=True)
                    H = hessian_mat(partial(closure, backward=False), params,
                                    method=hessian_method, vectorize=vectorize, outer_jacobian_strategy=strat) # pyright:ignore[reportAssignmentType]

            else:
                raise ValueError(hessian_method)

            if damping != 0: H.add_(torch.eye(H.size(-1), dtype=H.dtype, device=H.device).mul_(damping))
            self.global_state['H'] = H

    @torch.no_grad
    def apply(self, var):
        H = self.global_state["H"]

        params = var.params
        settings = self.settings[params[0]]
        search_negative = settings['search_negative']
        H_tfm = settings['H_tfm']
        eigval_fn = settings['eigval_fn']
        use_lstsq = settings['use_lstsq']

        # -------------------------------- inner step -------------------------------- #
        update = var.get_update()
        if 'inner' in self.children:
            update = apply_transform(self.children['inner'], update, params=params, grads=var.grad, var=var)

        g = torch.cat([t.ravel() for t in update])

        # ----------------------------------- solve ---------------------------------- #
        update = None
        if H_tfm is not None:
            ret = H_tfm(H, g)

            if isinstance(ret, torch.Tensor):
                update = ret

            else: # returns (H, is_inv)
                H, is_inv = ret
                if is_inv: update = H @ g

        if search_negative or (eigval_fn is not None):
            update = _eigh_solve(H, g, eigval_fn, search_negative=search_negative)

        if update is None and use_lstsq: update = _least_squares_solve(H, g)
        if update is None: update = _cholesky_solve(H, g)
        if update is None: update = _lu_solve(H, g)
        if update is None: update = _least_squares_solve(H, g)

        var.update = vec_to_tensors(update, params)

        return var

    def get_H(self,var):
        H = self.global_state["H"]
        settings = self.defaults
        if settings['eigval_fn'] is not None:
            try:
                L, Q = torch.linalg.eigh(H) # pylint:disable=not-callable
                L = settings['eigval_fn'](L)
                H = Q @ L.diag_embed() @ Q.mH
                H_inv = Q @ L.reciprocal().diag_embed() @ Q.mH
                return DenseWithInverse(H, H_inv)

            except torch.linalg.LinAlgError:
                pass

        return Dense(H)

NewtonCG

Bases: torchzero.core.module.Module

Newton's method with a matrix-free conjugate gradient or minimial-residual solver.

Notes
  • In most cases NewtonCGSteihaug should be the first module in the chain because it relies on autograd. Use the inner argument if you wish to apply Newton preconditioning to another module's output.

  • This module requires the a closure passed to the optimizer step, as it needs to re-evaluate the loss and gradients for calculating HVPs. The closure must accept a backward argument (refer to documentation).

Warning

CG may fail if hessian is not positive-definite.

Parameters:

  • maxiter (int | None, default: None ) –

    Maximum number of iterations for the conjugate gradient solver. By default, this is set to the number of dimensions in the objective function, which is the theoretical upper bound for CG convergence. Setting this to a smaller value (truncated Newton) can still generate good search directions. Defaults to None.

  • tol (float, default: 1e-08 ) –

    Relative tolerance for the conjugate gradient solver to determine convergence. Defaults to 1e-4.

  • reg (float, default: 1e-08 ) –

    Regularization parameter (damping) added to the Hessian diagonal. This helps ensure the system is positive-definite. Defaults to 1e-8.

  • hvp_method (str, default: 'autograd' ) –

    Determines how Hessian-vector products are evaluated.

    • "autograd": Use PyTorch's autograd to calculate exact HVPs. This requires creating a graph for the gradient.
    • "forward": Use a forward finite difference formula to approximate the HVP. This requires one extra gradient evaluation.
    • "central": Use a central finite difference formula for a more accurate HVP approximation. This requires two extra gradient evaluations. Defaults to "autograd".
  • h (float, default: 0.001 ) –

    The step size for finite differences if :code:hvp_method is "forward" or "central". Defaults to 1e-3.

  • warm_start (bool, default: False ) –

    If True, the conjugate gradient solver is initialized with the solution from the previous optimization step. This can accelerate convergence, especially in truncated Newton methods. Defaults to False.

  • inner (Chainable | None, default: None ) –

    NewtonCG will attempt to apply preconditioning to the output of this module.

Examples: Newton-CG with a backtracking line search:

opt = tz.Modular(
    model.parameters(),
    tz.m.NewtonCG(),
    tz.m.Backtracking()
)

Truncated Newton method (useful for large-scale problems):

opt = tz.Modular(
    model.parameters(),
    tz.m.NewtonCG(maxiter=10),
    tz.m.Backtracking()
)

Source code in torchzero/modules/second_order/newton_cg.py
class NewtonCG(Module):
    """Newton's method with a matrix-free conjugate gradient or minimial-residual solver.

    Notes:
        * In most cases NewtonCGSteihaug should be the first module in the chain because it relies on autograd. Use the ``inner`` argument if you wish to apply Newton preconditioning to another module's output.

        * This module requires the a closure passed to the optimizer step, as it needs to re-evaluate the loss and gradients for calculating HVPs. The closure must accept a ``backward`` argument (refer to documentation).

    Warning:
        CG may fail if hessian is not positive-definite.

    Args:
        maxiter (int | None, optional):
            Maximum number of iterations for the conjugate gradient solver.
            By default, this is set to the number of dimensions in the
            objective function, which is the theoretical upper bound for CG
            convergence. Setting this to a smaller value (truncated Newton)
            can still generate good search directions. Defaults to None.
        tol (float, optional):
            Relative tolerance for the conjugate gradient solver to determine
            convergence. Defaults to 1e-4.
        reg (float, optional):
            Regularization parameter (damping) added to the Hessian diagonal.
            This helps ensure the system is positive-definite. Defaults to 1e-8.
        hvp_method (str, optional):
            Determines how Hessian-vector products are evaluated.

            - ``"autograd"``: Use PyTorch's autograd to calculate exact HVPs.
              This requires creating a graph for the gradient.
            - ``"forward"``: Use a forward finite difference formula to
              approximate the HVP. This requires one extra gradient evaluation.
            - ``"central"``: Use a central finite difference formula for a
              more accurate HVP approximation. This requires two extra
              gradient evaluations.
            Defaults to "autograd".
        h (float, optional):
            The step size for finite differences if :code:`hvp_method` is
            ``"forward"`` or ``"central"``. Defaults to 1e-3.
        warm_start (bool, optional):
            If ``True``, the conjugate gradient solver is initialized with the
            solution from the previous optimization step. This can accelerate
            convergence, especially in truncated Newton methods.
            Defaults to False.
        inner (Chainable | None, optional):
            NewtonCG will attempt to apply preconditioning to the output of this module.

    Examples:
    Newton-CG with a backtracking line search:

    ```python
    opt = tz.Modular(
        model.parameters(),
        tz.m.NewtonCG(),
        tz.m.Backtracking()
    )
    ```

    Truncated Newton method (useful for large-scale problems):
    ```
    opt = tz.Modular(
        model.parameters(),
        tz.m.NewtonCG(maxiter=10),
        tz.m.Backtracking()
    )
    ```

    """
    def __init__(
        self,
        maxiter: int | None = None,
        tol: float = 1e-8,
        reg: float = 1e-8,
        hvp_method: Literal["forward", "central", "autograd"] = "autograd",
        solver: Literal['cg', 'minres', 'minres_npc'] = 'cg',
        h: float = 1e-3, # tuned 1e-4 or 1e-3
        miniter:int = 1,
        warm_start=False,
        inner: Chainable | None = None,
    ):
        defaults = locals().copy()
        del defaults['self'], defaults['inner']
        super().__init__(defaults,)

        if inner is not None:
            self.set_child('inner', inner)

        self._num_hvps = 0
        self._num_hvps_last_step = 0

    @torch.no_grad
    def step(self, var):
        params = TensorList(var.params)
        closure = var.closure
        if closure is None: raise RuntimeError('NewtonCG requires closure')

        settings = self.settings[params[0]]
        tol = settings['tol']
        reg = settings['reg']
        maxiter = settings['maxiter']
        hvp_method = settings['hvp_method']
        solver = settings['solver'].lower().strip()
        h = settings['h']
        warm_start = settings['warm_start']

        self._num_hvps_last_step = 0
        # ---------------------- Hessian vector product function --------------------- #
        if hvp_method == 'autograd':
            grad = var.get_grad(create_graph=True)

            def H_mm(x):
                self._num_hvps_last_step += 1
                with torch.enable_grad():
                    return TensorList(hvp(params, grad, x, retain_graph=True))

        else:

            with torch.enable_grad():
                grad = var.get_grad()

            if hvp_method == 'forward':
                def H_mm(x):
                    self._num_hvps_last_step += 1
                    return TensorList(hvp_fd_forward(closure, params, x, h=h, g_0=grad, normalize=True)[1])

            elif hvp_method == 'central':
                def H_mm(x):
                    self._num_hvps_last_step += 1
                    return TensorList(hvp_fd_central(closure, params, x, h=h, normalize=True)[1])

            else:
                raise ValueError(hvp_method)


        # -------------------------------- inner step -------------------------------- #
        b = var.get_update()
        if 'inner' in self.children:
            b = apply_transform(self.children['inner'], b, params=params, grads=grad, var=var)
        b = as_tensorlist(b)

        # ---------------------------------- run cg ---------------------------------- #
        x0 = None
        if warm_start: x0 = self.get_state(params, 'prev_x', cls=TensorList) # initialized to 0 which is default anyway

        if solver == 'cg':
            d, _ = cg(A_mm=H_mm, b=b, x0=x0, tol=tol, maxiter=maxiter, miniter=self.defaults["miniter"],reg=reg)

        elif solver == 'minres':
            d = minres(A_mm=H_mm, b=b, x0=x0, tol=tol, maxiter=maxiter, reg=reg, npc_terminate=False)

        elif solver == 'minres_npc':
            d = minres(A_mm=H_mm, b=b, x0=x0, tol=tol, maxiter=maxiter, reg=reg, npc_terminate=True)

        else:
            raise ValueError(f"Unknown solver {solver}")

        if warm_start:
            assert x0 is not None
            x0.copy_(d)

        var.update = d

        self._num_hvps += self._num_hvps_last_step
        return var

NewtonCGSteihaug

Bases: torchzero.core.module.Module

Newton's method with trust region and a matrix-free Steihaug-Toint conjugate gradient solver.

Notes
  • In most cases NewtonCGSteihaug should be the first module in the chain because it relies on autograd. Use the inner argument if you wish to apply Newton preconditioning to another module's output.

  • This module requires the a closure passed to the optimizer step, as it needs to re-evaluate the loss and gradients for calculating HVPs. The closure must accept a backward argument (refer to documentation).

Parameters:

  • eta (float, default: 0.0 ) –

    if ratio of actual to predicted rediction is larger than this, step is accepted. Defaults to 0.0.

  • nplus (float, default: 3.5 ) –

    increase factor on successful steps. Defaults to 1.5.

  • nminus (float, default: 0.25 ) –

    decrease factor on unsuccessful steps. Defaults to 0.75.

  • rho_good (float, default: 0.99 ) –

    if ratio of actual to predicted rediction is larger than this, trust region size is multiplied by nplus.

  • rho_bad (float, default: 0.0001 ) –

    if ratio of actual to predicted rediction is less than this, trust region size is multiplied by nminus.

  • init (float, default: 1 ) –

    Initial trust region value. Defaults to 1.

  • max_attempts (max_attempts, default: 100 ) –

    maximum number of trust radius reductions per step. A zero update vector is returned when this limit is exceeded. Defaults to 10.

  • max_history (int, default: 100 ) –

    CG will store this many intermediate solutions, reusing them when trust radius is reduced instead of re-running CG. Each solution storage requires 2N memory. Defaults to 100.

  • boundary_tol (float | None, default: 1e-06 ) –

    The trust region only increases when suggested step's norm is at least (1-boundary_tol)*trust_region. This prevents increasing trust region when solution is not on the boundary. Defaults to 1e-2.

  • maxiter (int | None, default: None ) –

    maximum number of CG iterations per step. Each iteration requies one backward pass if hvp_method="forward", two otherwise. Defaults to None.

  • miniter (int, default: 1 ) –

    minimal number of CG iterations. This prevents making no progress

  • tol (float, default: 1e-08 ) –

    terminates CG when norm of the residual is less than this value. Defaults to 1e-8. when initial guess is below tolerance. Defaults to 1.

  • reg (float, default: 1e-08 ) –

    hessian regularization. Defaults to 1e-8.

  • solver (str, default: 'cg' ) –

    solver, "cg" or "minres". "cg" is recommended. Defaults to 'cg'.

  • adapt_tol (bool, default: True ) –

    if True, whenever trust radius collapses to smallest representable number, the tolerance is multiplied by 0.1. Defaults to True.

  • npc_terminate (bool, default: False ) –

    whether to terminate CG/MINRES whenever negative curvature is detected. Defaults to False.

  • hvp_method (str, default: 'central' ) –

    either "forward" to use forward formula which requires one backward pass per Hvp, or "central" to use a more accurate central formula which requires two backward passes. "forward" is usually accurate enough. Defaults to "forward".

  • h (float, default: 0.001 ) –

    finite difference step size. Defaults to 1e-3.

  • inner (Chainable | None, default: None ) –

    applies preconditioning to output of this module. Defaults to None.

Examples:

Trust-region Newton-CG:

opt = tz.Modular(
    model.parameters(),
    tz.m.NewtonCGSteihaug(),
)
Reference:
Steihaug, Trond. "The conjugate gradient method and trust regions in large scale optimization." SIAM Journal on Numerical Analysis 20.3 (1983): 626-637.
Source code in torchzero/modules/second_order/newton_cg.py
class NewtonCGSteihaug(Module):
    """Newton's method with trust region and a matrix-free Steihaug-Toint conjugate gradient solver.

    Notes:
        * In most cases NewtonCGSteihaug should be the first module in the chain because it relies on autograd. Use the ``inner`` argument if you wish to apply Newton preconditioning to another module's output.

        * This module requires the a closure passed to the optimizer step, as it needs to re-evaluate the loss and gradients for calculating HVPs. The closure must accept a ``backward`` argument (refer to documentation).

    Args:
        eta (float, optional):
            if ratio of actual to predicted rediction is larger than this, step is accepted. Defaults to 0.0.
        nplus (float, optional): increase factor on successful steps. Defaults to 1.5.
        nminus (float, optional): decrease factor on unsuccessful steps. Defaults to 0.75.
        rho_good (float, optional):
            if ratio of actual to predicted rediction is larger than this, trust region size is multiplied by `nplus`.
        rho_bad (float, optional):
            if ratio of actual to predicted rediction is less than this, trust region size is multiplied by `nminus`.
        init (float, optional): Initial trust region value. Defaults to 1.
        max_attempts (max_attempts, optional):
            maximum number of trust radius reductions per step. A zero update vector is returned when
            this limit is exceeded. Defaults to 10.
        max_history (int, optional):
            CG will store this many intermediate solutions, reusing them when trust radius is reduced
            instead of re-running CG. Each solution storage requires 2N memory. Defaults to 100.
        boundary_tol (float | None, optional):
            The trust region only increases when suggested step's norm is at least `(1-boundary_tol)*trust_region`.
            This prevents increasing trust region when solution is not on the boundary. Defaults to 1e-2.

        maxiter (int | None, optional):
            maximum number of CG iterations per step. Each iteration requies one backward pass if `hvp_method="forward"`, two otherwise. Defaults to None.
        miniter (int, optional):
            minimal number of CG iterations. This prevents making no progress
        tol (float, optional):
            terminates CG when norm of the residual is less than this value. Defaults to 1e-8.
            when initial guess is below tolerance. Defaults to 1.
        reg (float, optional): hessian regularization. Defaults to 1e-8.
        solver (str, optional): solver, "cg" or "minres". "cg" is recommended. Defaults to 'cg'.
        adapt_tol (bool, optional):
            if True, whenever trust radius collapses to smallest representable number,
            the tolerance is multiplied by 0.1. Defaults to True.
        npc_terminate (bool, optional):
            whether to terminate CG/MINRES whenever negative curvature is detected. Defaults to False.

        hvp_method (str, optional):
            either "forward" to use forward formula which requires one backward pass per Hvp, or "central" to use a more accurate central formula which requires two backward passes. "forward" is usually accurate enough. Defaults to "forward".
        h (float, optional): finite difference step size. Defaults to 1e-3.

        inner (Chainable | None, optional):
            applies preconditioning to output of this module. Defaults to None.

    ### Examples:
    Trust-region Newton-CG:

    ```python
    opt = tz.Modular(
        model.parameters(),
        tz.m.NewtonCGSteihaug(),
    )
    ```

    ### Reference:
        Steihaug, Trond. "The conjugate gradient method and trust regions in large scale optimization." SIAM Journal on Numerical Analysis 20.3 (1983): 626-637.
    """
    def __init__(
        self,
        # trust region settings
        eta: float= 0.0,
        nplus: float = 3.5,
        nminus: float = 0.25,
        rho_good: float = 0.99,
        rho_bad: float = 1e-4,
        init: float = 1,
        max_attempts: int = 100,
        max_history: int = 100,
        boundary_tol: float = 1e-6, # tuned

        # cg settings
        maxiter: int | None = None,
        miniter: int = 1,
        tol: float = 1e-8,
        reg: float = 1e-8,
        solver: Literal['cg', "minres"] = 'cg',
        adapt_tol: bool = True,
        npc_terminate: bool = False,

        # hvp settings
        hvp_method: Literal["forward", "central"] = "central",
        h: float = 1e-3, # tuned 1e-4 or 1e-3

        # inner
        inner: Chainable | None = None,
    ):
        defaults = locals().copy()
        del defaults['self'], defaults['inner']
        super().__init__(defaults,)

        if inner is not None:
            self.set_child('inner', inner)

        self._num_hvps = 0
        self._num_hvps_last_step = 0

    @torch.no_grad
    def step(self, var):
        params = TensorList(var.params)
        closure = var.closure
        if closure is None: raise RuntimeError('NewtonCG requires closure')

        tol = self.defaults['tol'] * self.global_state.get('tol_mul', 1)
        solver = self.defaults['solver'].lower().strip()

        (reg, maxiter, hvp_method, h, max_attempts, boundary_tol,
         eta, nplus, nminus, rho_good, rho_bad, init, npc_terminate,
         miniter, max_history, adapt_tol) = itemgetter(
             "reg", "maxiter", "hvp_method", "h", "max_attempts", "boundary_tol",
             "eta", "nplus", "nminus", "rho_good", "rho_bad", "init", "npc_terminate",
             "miniter", "max_history", "adapt_tol",
        )(self.defaults)

        self._num_hvps_last_step = 0

        # ---------------------- Hessian vector product function --------------------- #
        if hvp_method == 'autograd':
            grad = var.get_grad(create_graph=True)

            def H_mm(x):
                self._num_hvps_last_step += 1
                with torch.enable_grad():
                    return TensorList(hvp(params, grad, x, retain_graph=True))

        else:

            with torch.enable_grad():
                grad = var.get_grad()

            if hvp_method == 'forward':
                def H_mm(x):
                    self._num_hvps_last_step += 1
                    return TensorList(hvp_fd_forward(closure, params, x, h=h, g_0=grad, normalize=True)[1])

            elif hvp_method == 'central':
                def H_mm(x):
                    self._num_hvps_last_step += 1
                    return TensorList(hvp_fd_central(closure, params, x, h=h, normalize=True)[1])

            else:
                raise ValueError(hvp_method)


        # -------------------------------- inner step -------------------------------- #
        b = var.get_update()
        if 'inner' in self.children:
            b = apply_transform(self.children['inner'], b, params=params, grads=grad, var=var)
        b = as_tensorlist(b)

        # ------------------------------- trust region ------------------------------- #
        success = False
        d = None
        x0 = [p.clone() for p in params]
        solution = None

        while not success:
            max_attempts -= 1
            if max_attempts < 0: break

            trust_radius = self.global_state.get('trust_radius', init)

            # -------------- make sure trust radius isn't too small or large ------------- #
            finfo = torch.finfo(x0[0].dtype)
            if trust_radius < finfo.tiny * 2:
                trust_radius = self.global_state['trust_radius'] = init
                if adapt_tol:
                    self.global_state["tol_mul"] = self.global_state.get("tol_mul", 1) * 0.1

            elif trust_radius > finfo.max / 2:
                trust_radius = self.global_state['trust_radius'] = init

            # ----------------------------------- solve ---------------------------------- #
            d = None
            if solution is not None and solution.history is not None:
                d = find_within_trust_radius(solution.history, trust_radius)

            if d is None:
                if solver == 'cg':
                    d, solution = cg(
                        A_mm=H_mm,
                        b=b,
                        tol=tol,
                        maxiter=maxiter,
                        reg=reg,
                        trust_radius=trust_radius,
                        miniter=miniter,
                        npc_terminate=npc_terminate,
                        history_size=max_history,
                    )

                elif solver == 'minres':
                    d = minres(A_mm=H_mm, b=b, trust_radius=trust_radius, tol=tol, maxiter=maxiter, reg=reg, npc_terminate=npc_terminate)

                else:
                    raise ValueError(f"unknown solver {solver}")

            # ---------------------------- update trust radius --------------------------- #
            self.global_state["trust_radius"], success = default_radius(
                params=params,
                closure=closure,
                f=tofloat(var.get_loss(False)),
                g=b,
                H=H_mm,
                d=d,
                trust_radius=trust_radius,
                eta=eta,
                nplus=nplus,
                nminus=nminus,
                rho_good=rho_good,
                rho_bad=rho_bad,
                boundary_tol=boundary_tol,

                init=init, # init isn't used because check_overflow=False
                state=self.global_state, # not used
                settings=self.defaults, # not used
                check_overflow=False, # this is checked manually to adapt tolerance
            )

        # --------------------------- assign new direction --------------------------- #
        assert d is not None
        if success:
            var.update = d

        else:
            var.update = params.zeros_like()

        self._num_hvps += self._num_hvps_last_step
        return var

NystromPCG

Bases: torchzero.core.module.Module

Newton's method with a Nyström-preconditioned conjugate gradient solver. This tends to outperform NewtonCG but requires tuning sketch size. An adaptive version exists in https://arxiv.org/abs/2110.02820, I might implement it too at some point.

.. note:: This module requires the a closure passed to the optimizer step, as it needs to re-evaluate the loss and gradients for calculating HVPs. The closure must accept a backward argument (refer to documentation).

.. note:: In most cases NystromPCG should be the first module in the chain because it relies on autograd. Use the :code:inner argument if you wish to apply Newton preconditioning to another module's output.

Parameters:

  • sketch_size (int) –

    size of the sketch for preconditioning, this many hessian-vector products will be evaluated before running the conjugate gradient solver. Larger value improves the preconditioning and speeds up conjugate gradient.

  • maxiter (int | None, default: None ) –

    maximum number of iterations. By default this is set to the number of dimensions in the objective function, which is supposed to be enough for conjugate gradient to have guaranteed convergence. Setting this to a small value can still generate good enough directions. Defaults to None.

  • tol (float, default: 0.001 ) –

    relative tolerance for conjugate gradient solver. Defaults to 1e-4.

  • reg (float, default: 1e-06 ) –

    regularization parameter. Defaults to 1e-8.

  • hvp_method (str, default: 'autograd' ) –

    Determines how Hessian-vector products are evaluated.

    • "autograd": Use PyTorch's autograd to calculate exact HVPs. This requires creating a graph for the gradient.
    • "forward": Use a forward finite difference formula to approximate the HVP. This requires one extra gradient evaluation.
    • "central": Use a central finite difference formula for a more accurate HVP approximation. This requires two extra gradient evaluations. Defaults to "autograd".
  • h (float, default: 0.001 ) –

    finite difference step size if :code:hvp_method is "forward" or "central". Defaults to 1e-3.

  • inner (Chainable | None, default: None ) –

    modules to apply hessian preconditioner to. Defaults to None.

  • seed (int | None, default: None ) –

    seed for random generator. Defaults to None.

Examples:

NystromPCG with backtracking line search

.. code-block:: python

    opt = tz.Modular(
        model.parameters(),
        tz.m.NystromPCG(10),
        tz.m.Backtracking()
    )
Reference

Frangella, Z., Tropp, J. A., & Udell, M. (2023). Randomized nyström preconditioning. SIAM Journal on Matrix Analysis and Applications, 44(2), 718-752. https://arxiv.org/abs/2110.02820

Source code in torchzero/modules/second_order/nystrom.py
class NystromPCG(Module):
    """Newton's method with a Nyström-preconditioned conjugate gradient solver.
    This tends to outperform NewtonCG but requires tuning sketch size.
    An adaptive version exists in https://arxiv.org/abs/2110.02820, I might implement it too at some point.

    .. note::
        This module requires the a closure passed to the optimizer step,
        as it needs to re-evaluate the loss and gradients for calculating HVPs.
        The closure must accept a ``backward`` argument (refer to documentation).

    .. note::
        In most cases NystromPCG should be the first module in the chain because it relies on autograd. Use the :code:`inner` argument if you wish to apply Newton preconditioning to another module's output.

    Args:
        sketch_size (int):
            size of the sketch for preconditioning, this many hessian-vector products will be evaluated before
            running the conjugate gradient solver. Larger value improves the preconditioning and speeds up
            conjugate gradient.
        maxiter (int | None, optional):
            maximum number of iterations. By default this is set to the number of dimensions
            in the objective function, which is supposed to be enough for conjugate gradient
            to have guaranteed convergence. Setting this to a small value can still generate good enough directions.
            Defaults to None.
        tol (float, optional): relative tolerance for conjugate gradient solver. Defaults to 1e-4.
        reg (float, optional): regularization parameter. Defaults to 1e-8.
        hvp_method (str, optional):
            Determines how Hessian-vector products are evaluated.

            - ``"autograd"``: Use PyTorch's autograd to calculate exact HVPs.
              This requires creating a graph for the gradient.
            - ``"forward"``: Use a forward finite difference formula to
              approximate the HVP. This requires one extra gradient evaluation.
            - ``"central"``: Use a central finite difference formula for a
              more accurate HVP approximation. This requires two extra
              gradient evaluations.
            Defaults to "autograd".
        h (float, optional): finite difference step size if :code:`hvp_method` is "forward" or "central". Defaults to 1e-3.
        inner (Chainable | None, optional): modules to apply hessian preconditioner to. Defaults to None.
        seed (int | None, optional): seed for random generator. Defaults to None.

    Examples:

        NystromPCG with backtracking line search

        .. code-block:: python

            opt = tz.Modular(
                model.parameters(),
                tz.m.NystromPCG(10),
                tz.m.Backtracking()
            )

    Reference:
        Frangella, Z., Tropp, J. A., & Udell, M. (2023). Randomized nyström preconditioning. SIAM Journal on Matrix Analysis and Applications, 44(2), 718-752. https://arxiv.org/abs/2110.02820

    """
    def __init__(
        self,
        sketch_size: int,
        maxiter=None,
        tol=1e-3,
        reg: float = 1e-6,
        hvp_method: Literal["forward", "central", "autograd"] = "autograd",
        h=1e-3,
        inner: Chainable | None = None,
        seed: int | None = None,
    ):
        defaults = dict(sketch_size=sketch_size, reg=reg, maxiter=maxiter, tol=tol, hvp_method=hvp_method, h=h, seed=seed)
        super().__init__(defaults,)

        if inner is not None:
            self.set_child('inner', inner)

    @torch.no_grad
    def step(self, var):
        params = TensorList(var.params)

        closure = var.closure
        if closure is None: raise RuntimeError('NewtonCG requires closure')

        settings = self.settings[params[0]]
        sketch_size = settings['sketch_size']
        maxiter = settings['maxiter']
        tol = settings['tol']
        reg = settings['reg']
        hvp_method = settings['hvp_method']
        h = settings['h']


        seed = settings['seed']
        generator = None
        if seed is not None:
            if 'generator' not in self.global_state:
                self.global_state['generator'] = torch.Generator(params[0].device).manual_seed(seed)
            generator = self.global_state['generator']


        # ---------------------- Hessian vector product function --------------------- #
        if hvp_method == 'autograd':
            grad = var.get_grad(create_graph=True)

            def H_mm(x):
                with torch.enable_grad():
                    Hvp = hvp(params, grad, params.from_vec(x), retain_graph=True)
                    return torch.cat([t.ravel() for t in Hvp])

        else:

            with torch.enable_grad():
                grad = var.get_grad()

            if hvp_method == 'forward':
                def H_mm(x):
                    Hvp = hvp_fd_forward(closure, params, params.from_vec(x), h=h, g_0=grad, normalize=True)[1]
                    return torch.cat([t.ravel() for t in Hvp])

            elif hvp_method == 'central':
                def H_mm(x):
                    Hvp = hvp_fd_central(closure, params, params.from_vec(x), h=h, normalize=True)[1]
                    return torch.cat([t.ravel() for t in Hvp])

            else:
                raise ValueError(hvp_method)


        # -------------------------------- inner step -------------------------------- #
        b = var.get_update()
        if 'inner' in self.children:
            b = apply_transform(self.children['inner'], b, params=params, grads=grad, var=var)

        # ------------------------------ sketch&n&solve ------------------------------ #
        x = nystrom_pcg(A_mm=H_mm, b=torch.cat([t.ravel() for t in b]), sketch_size=sketch_size, reg=reg, tol=tol, maxiter=maxiter, x0_=None, generator=generator)
        var.update = vec_to_tensors(x, reference=params)
        return var

NystromSketchAndSolve

Bases: torchzero.core.module.Module

Newton's method with a Nyström sketch-and-solve solver.

.. note:: This module requires the a closure passed to the optimizer step, as it needs to re-evaluate the loss and gradients for calculating HVPs. The closure must accept a backward argument (refer to documentation).

.. note:: In most cases NystromSketchAndSolve should be the first module in the chain because it relies on autograd. Use the :code:inner argument if you wish to apply Newton preconditioning to another module's output.

.. note:: If this is unstable, increase the :code:reg parameter and tune the rank.

.. note: :code:tz.m.NystromPCG usually outperforms this.

Parameters:

  • rank (int) –

    size of the sketch, this many hessian-vector products will be evaluated per step.

  • reg (float, default: 0.001 ) –

    regularization parameter. Defaults to 1e-3.

  • hvp_method (str, default: 'autograd' ) –

    Determines how Hessian-vector products are evaluated.

    • "autograd": Use PyTorch's autograd to calculate exact HVPs. This requires creating a graph for the gradient.
    • "forward": Use a forward finite difference formula to approximate the HVP. This requires one extra gradient evaluation.
    • "central": Use a central finite difference formula for a more accurate HVP approximation. This requires two extra gradient evaluations. Defaults to "autograd".
  • h (float, default: 0.001 ) –

    finite difference step size if :code:hvp_method is "forward" or "central". Defaults to 1e-3.

  • inner (Chainable | None, default: None ) –

    modules to apply hessian preconditioner to. Defaults to None.

  • seed (int | None, default: None ) –

    seed for random generator. Defaults to None.

Examples:

NystromSketchAndSolve with backtracking line search

.. code-block:: python

opt = tz.Modular(
    model.parameters(),
    tz.m.NystromSketchAndSolve(10),
    tz.m.Backtracking()
)
Reference

Frangella, Z., Tropp, J. A., & Udell, M. (2023). Randomized nyström preconditioning. SIAM Journal on Matrix Analysis and Applications, 44(2), 718-752. https://arxiv.org/abs/2110.02820

Source code in torchzero/modules/second_order/nystrom.py
class NystromSketchAndSolve(Module):
    """Newton's method with a Nyström sketch-and-solve solver.

    .. note::
        This module requires the a closure passed to the optimizer step,
        as it needs to re-evaluate the loss and gradients for calculating HVPs.
        The closure must accept a ``backward`` argument (refer to documentation).

    .. note::
        In most cases NystromSketchAndSolve should be the first module in the chain because it relies on autograd. Use the :code:`inner` argument if you wish to apply Newton preconditioning to another module's output.

    .. note::
        If this is unstable, increase the :code:`reg` parameter and tune the rank.

    .. note:
        :code:`tz.m.NystromPCG` usually outperforms this.

    Args:
        rank (int): size of the sketch, this many hessian-vector products will be evaluated per step.
        reg (float, optional): regularization parameter. Defaults to 1e-3.
        hvp_method (str, optional):
            Determines how Hessian-vector products are evaluated.

            - ``"autograd"``: Use PyTorch's autograd to calculate exact HVPs.
              This requires creating a graph for the gradient.
            - ``"forward"``: Use a forward finite difference formula to
              approximate the HVP. This requires one extra gradient evaluation.
            - ``"central"``: Use a central finite difference formula for a
              more accurate HVP approximation. This requires two extra
              gradient evaluations.
            Defaults to "autograd".
        h (float, optional): finite difference step size if :code:`hvp_method` is "forward" or "central". Defaults to 1e-3.
        inner (Chainable | None, optional): modules to apply hessian preconditioner to. Defaults to None.
        seed (int | None, optional): seed for random generator. Defaults to None.

    Examples:
        NystromSketchAndSolve with backtracking line search

        .. code-block:: python

            opt = tz.Modular(
                model.parameters(),
                tz.m.NystromSketchAndSolve(10),
                tz.m.Backtracking()
            )

    Reference:
        Frangella, Z., Tropp, J. A., & Udell, M. (2023). Randomized nyström preconditioning. SIAM Journal on Matrix Analysis and Applications, 44(2), 718-752. https://arxiv.org/abs/2110.02820
    """
    def __init__(
        self,
        rank: int,
        reg: float = 1e-3,
        hvp_method: Literal["forward", "central", "autograd"] = "autograd",
        h: float = 1e-3,
        inner: Chainable | None = None,
        seed: int | None = None,
    ):
        defaults = dict(rank=rank, reg=reg, hvp_method=hvp_method, h=h, seed=seed)
        super().__init__(defaults,)

        if inner is not None:
            self.set_child('inner', inner)

    @torch.no_grad
    def step(self, var):
        params = TensorList(var.params)

        closure = var.closure
        if closure is None: raise RuntimeError('NewtonCG requires closure')

        settings = self.settings[params[0]]
        rank = settings['rank']
        reg = settings['reg']
        hvp_method = settings['hvp_method']
        h = settings['h']

        seed = settings['seed']
        generator = None
        if seed is not None:
            if 'generator' not in self.global_state:
                self.global_state['generator'] = torch.Generator(params[0].device).manual_seed(seed)
            generator = self.global_state['generator']

        # ---------------------- Hessian vector product function --------------------- #
        if hvp_method == 'autograd':
            grad = var.get_grad(create_graph=True)

            def H_mm(x):
                with torch.enable_grad():
                    Hvp = hvp(params, grad, params.from_vec(x), retain_graph=True)
                    return torch.cat([t.ravel() for t in Hvp])

        else:

            with torch.enable_grad():
                grad = var.get_grad()

            if hvp_method == 'forward':
                def H_mm(x):
                    Hvp = hvp_fd_forward(closure, params, params.from_vec(x), h=h, g_0=grad, normalize=True)[1]
                    return torch.cat([t.ravel() for t in Hvp])

            elif hvp_method == 'central':
                def H_mm(x):
                    Hvp = hvp_fd_central(closure, params, params.from_vec(x), h=h, normalize=True)[1]
                    return torch.cat([t.ravel() for t in Hvp])

            else:
                raise ValueError(hvp_method)


        # -------------------------------- inner step -------------------------------- #
        b = var.get_update()
        if 'inner' in self.children:
            b = apply_transform(self.children['inner'], b, params=params, grads=grad, var=var)

        # ------------------------------ sketch&n&solve ------------------------------ #
        x = nystrom_sketch_and_solve(A_mm=H_mm, b=torch.cat([t.ravel() for t in b]), rank=rank, reg=reg, generator=generator)
        var.update = vec_to_tensors(x, reference=params)
        return var

SixthOrder3P

Bases: torchzero.modules.second_order.multipoint.HigherOrderMethodBase

Sixth-order iterative method.

Abro, Hameer Akhtar, and Muhammad Mujtaba Shaikh. "A new time-efficient and convergent nonlinear solver." Applied Mathematics and Computation 355 (2019): 516-536.

Source code in torchzero/modules/second_order/multipoint.py
class SixthOrder3P(HigherOrderMethodBase):
    """Sixth-order iterative method.

    Abro, Hameer Akhtar, and Muhammad Mujtaba Shaikh. "A new time-efficient and convergent nonlinear solver." Applied Mathematics and Computation 355 (2019): 516-536.
    """
    def __init__(self, lstsq: bool=False, vectorize: bool = True):
        defaults=dict(lstsq=lstsq)
        super().__init__(defaults=defaults, vectorize=vectorize)

    def one_iteration(self, x, evaluate, var):
        settings = self.defaults
        lstsq = settings['lstsq']
        def f(x): return evaluate(x, 1)[1]
        def f_j(x): return evaluate(x, 2)[1:]
        x_star = sixth_order_3p(x, f, f_j, lstsq)
        return x - x_star

SixthOrder3PM2

Bases: torchzero.modules.second_order.multipoint.HigherOrderMethodBase

Wang, Xiaofeng, and Yang Li. "An efficient sixth-order Newton-type method for solving nonlinear systems." Algorithms 10.2 (2017): 45.

Source code in torchzero/modules/second_order/multipoint.py
class SixthOrder3PM2(HigherOrderMethodBase):
    """Wang, Xiaofeng, and Yang Li. "An efficient sixth-order Newton-type method for solving nonlinear systems." Algorithms 10.2 (2017): 45."""
    def __init__(self, lstsq: bool=False, vectorize: bool = True):
        defaults=dict(lstsq=lstsq)
        super().__init__(defaults=defaults, vectorize=vectorize)

    def one_iteration(self, x, evaluate, var):
        settings = self.defaults
        lstsq = settings['lstsq']
        def f_j(x): return evaluate(x, 2)[1:]
        def f(x): return evaluate(x, 1)[1]
        x_star = sixth_order_3pm2(x, f, f_j, lstsq)
        return x - x_star

SixthOrder5P

Bases: torchzero.modules.second_order.multipoint.HigherOrderMethodBase

Argyros, Ioannis K., et al. "Extended convergence for two sixth order methods under the same weak conditions." Foundations 3.1 (2023): 127-139.

Source code in torchzero/modules/second_order/multipoint.py
class SixthOrder5P(HigherOrderMethodBase):
    """Argyros, Ioannis K., et al. "Extended convergence for two sixth order methods under the same weak conditions." Foundations 3.1 (2023): 127-139."""
    def __init__(self, lstsq: bool=False, vectorize: bool = True):
        defaults=dict(lstsq=lstsq)
        super().__init__(defaults=defaults, vectorize=vectorize)

    def one_iteration(self, x, evaluate, var):
        settings = self.defaults
        lstsq = settings['lstsq']
        def f_j(x): return evaluate(x, 2)[1:]
        x_star = sixth_order_5p(x, f_j, lstsq)
        return x - x_star

TwoPointNewton

Bases: torchzero.modules.second_order.multipoint.HigherOrderMethodBase

two-point Newton method with frozen derivative with third order convergence.

Sharma, Janak Raj, and Deepak Kumar. "A fast and efficient composite Newton–Chebyshev method for systems of nonlinear equations." Journal of Complexity 49 (2018): 56-73.

Source code in torchzero/modules/second_order/multipoint.py
class TwoPointNewton(HigherOrderMethodBase):
    """two-point Newton method with frozen derivative with third order convergence.

    Sharma, Janak Raj, and Deepak Kumar. "A fast and efficient composite Newton–Chebyshev method for systems of nonlinear equations." Journal of Complexity 49 (2018): 56-73."""
    def __init__(self, lstsq: bool=False, vectorize: bool = True):
        defaults=dict(lstsq=lstsq)
        super().__init__(defaults=defaults, vectorize=vectorize)

    def one_iteration(self, x, evaluate, var):
        settings = self.defaults
        lstsq = settings['lstsq']
        def f(x): return evaluate(x, 1)[1]
        def f_j(x): return evaluate(x, 2)[1:]
        x_star = two_point_newton(x, f, f_j, lstsq)
        return x - x_star