Second order¶
This subpackage contains "True" second order methods that use exact second order information.
See also¶
- Quasi-newton - quasi-newton methods that estimate the hessian using gradient information.
- Higher-order - methods of third and higher order.
Classes:
-
InverseFreeNewton
–Inverse-free newton's method
-
Newton
–Exact newton's method via autograd.
-
NewtonCG
–Newton's method with a matrix-free conjugate gradient or minimial-residual solver.
-
NewtonCGSteihaug
–Newton's method with trust region and a matrix-free Steihaug-Toint conjugate gradient solver.
-
NystromPCG
–Newton's method with a Nyström-preconditioned conjugate gradient solver.
-
NystromSketchAndSolve
–Newton's method with a Nyström sketch-and-solve solver.
-
SixthOrder3P
–Sixth-order iterative method.
-
SixthOrder3PM2
–Wang, Xiaofeng, and Yang Li. "An efficient sixth-order Newton-type method for solving nonlinear systems." Algorithms 10.2 (2017): 45.
-
SixthOrder5P
–Argyros, Ioannis K., et al. "Extended convergence for two sixth order methods under the same weak conditions." Foundations 3.1 (2023): 127-139.
-
TwoPointNewton
–two-point Newton method with frozen derivative with third order convergence.
InverseFreeNewton ¶
Bases: torchzero.core.module.Module
Inverse-free newton's method
.. note::
In most cases Newton should be the first module in the chain because it relies on autograd. Use the :code:inner
argument if you wish to apply Newton preconditioning to another module's output.
.. note::
This module requires the a closure passed to the optimizer step,
as it needs to re-evaluate the loss and gradients for calculating the hessian.
The closure must accept a backward
argument (refer to documentation).
.. warning:: this uses roughly O(N^2) memory.
Reference Massalski, Marcin, and Magdalena Nockowska-Rosiak. "INVERSE-FREE NEWTON'S METHOD." Journal of Applied Analysis & Computation 15.4 (2025): 2238-2257.
Source code in torchzero/modules/second_order/newton.py
284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 |
|
Newton ¶
Bases: torchzero.core.module.Module
Exact newton's method via autograd.
Newton's method produces a direction jumping to the stationary point of quadratic approximation of the target function.
The update rule is given by (H + yI)⁻¹g
, where H
is the hessian and g
is the gradient, y
is the damping
parameter.
g
can be output of another module, if it is specifed in inner
argument.
Note
In most cases Newton should be the first module in the chain because it relies on autograd. Use the :code:inner
argument if you wish to apply Newton preconditioning to another module's output.
Note
This module requires the a closure passed to the optimizer step,
as it needs to re-evaluate the loss and gradients for calculating the hessian.
The closure must accept a backward
argument (refer to documentation).
Parameters:
-
damping
(float
, default:0
) –tikhonov regularizer value. Set this to 0 when using trust region. Defaults to 0.
-
search_negative
((bool, Optional)
, default:False
) –if True, whenever a negative eigenvalue is detected, search direction is proposed along weighted sum of eigenvectors corresponding to negative eigenvalues.
-
use_lstsq
((bool, Optional)
, default:False
) –if True, least squares will be used to solve the linear system, this may generate reasonable directions when hessian is not invertible. If False, tries cholesky, if it fails tries LU, and then least squares. If
eigval_fn
is specified, eigendecomposition will always be used to solve the linear system and this argument will be ignored. -
hessian_method
(str
, default:'autograd'
) –how to calculate hessian. Defaults to "autograd".
-
vectorize
(bool
, default:True
) –whether to enable vectorized hessian. Defaults to True.
-
inner
(Chainable | None
, default:None
) –modules to apply hessian preconditioner to. Defaults to None.
-
H_tfm
(Callable | None
, default:None
) –optional hessian transforms, takes in two arguments -
(hessian, gradient)
.must return either a tuple:
(hessian, is_inverted)
with transformed hessian and a boolean value which must be True if transform inverted the hessian and False otherwise.Or it returns a single tensor which is used as the update.
Defaults to None.
-
eigval_fn
(Callable | None
, default:None
) –optional eigenvalues transform, for example
torch.abs
orlambda L: torch.clip(L, min=1e-8)
. If this is specified, eigendecomposition will be used to invert the hessian.
See also¶
tz.m.NewtonCG
: uses a matrix-free conjugate gradient solver and hessian-vector products, useful for large scale problems as it doesn't form the full hessian.tz.m.NewtonCGSteihaug
: trust region version oftz.m.NewtonCG
.tz.m.InverseFreeNewton
: an inverse-free variant of Newton's method.tz.m.quasi_newton
: large collection of quasi-newton methods that estimate the hessian.
Notes¶
Implementation details¶
(H + yI)⁻¹g
is calculated by solving the linear system (H + yI)x = g
.
The linear system is solved via cholesky decomposition, if that fails, LU decomposition, and if that fails, least squares.
Least squares can be forced by setting use_lstsq=True
, which may generate better search directions when linear system is overdetermined.
Additionally, if eigval_fn
is specified or search_negative
is True
,
eigendecomposition of the hessian is computed, eigval_fn
is applied to the eigenvalues,
and (H + yI)⁻¹
is computed using the computed eigenvectors and transformed eigenvalues.
This is more generally more computationally expensive.
Handling non-convexity¶
Standard Newton's method does not handle non-convexity well without some modifications. This is because it jumps to the stationary point, which may be the maxima of the quadratic approximation.
The first modification to handle non-convexity is to modify the eignevalues to be positive,
for example by setting eigval_fn = lambda L: L.abs().clip(min=1e-4)
.
Second modification is search_negative=True
, which will search along a negative curvature direction if one is detected.
This also requires an eigendecomposition.
The Newton direction can also be forced to be a descent direction by using tz.m.GradSign()
or tz.m.Cautious
,
but that may be significantly less efficient.
Examples:¶
Newton's method with backtracking line search
Newton preconditioning applied to momentum
Diagonal newton example. This will still evaluate the entire hessian so it isn't efficient, but if you wanted to see how diagonal newton behaves or compares to full newton, you can use this.
opt = tz.Modular(
model.parameters(),
tz.m.Newton(H_tfm = lambda H, g: g/H.diag()),
tz.m.Backtracking()
)
Source code in torchzero/modules/second_order/newton.py
55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 |
|
NewtonCG ¶
Bases: torchzero.core.module.Module
Newton's method with a matrix-free conjugate gradient or minimial-residual solver.
Notes
-
In most cases NewtonCGSteihaug should be the first module in the chain because it relies on autograd. Use the
inner
argument if you wish to apply Newton preconditioning to another module's output. -
This module requires the a closure passed to the optimizer step, as it needs to re-evaluate the loss and gradients for calculating HVPs. The closure must accept a
backward
argument (refer to documentation).
Warning
CG may fail if hessian is not positive-definite.
Parameters:
-
maxiter
(int | None
, default:None
) –Maximum number of iterations for the conjugate gradient solver. By default, this is set to the number of dimensions in the objective function, which is the theoretical upper bound for CG convergence. Setting this to a smaller value (truncated Newton) can still generate good search directions. Defaults to None.
-
tol
(float
, default:1e-08
) –Relative tolerance for the conjugate gradient solver to determine convergence. Defaults to 1e-4.
-
reg
(float
, default:1e-08
) –Regularization parameter (damping) added to the Hessian diagonal. This helps ensure the system is positive-definite. Defaults to 1e-8.
-
hvp_method
(str
, default:'autograd'
) –Determines how Hessian-vector products are evaluated.
"autograd"
: Use PyTorch's autograd to calculate exact HVPs. This requires creating a graph for the gradient."forward"
: Use a forward finite difference formula to approximate the HVP. This requires one extra gradient evaluation."central"
: Use a central finite difference formula for a more accurate HVP approximation. This requires two extra gradient evaluations. Defaults to "autograd".
-
h
(float
, default:0.001
) –The step size for finite differences if :code:
hvp_method
is"forward"
or"central"
. Defaults to 1e-3. -
warm_start
(bool
, default:False
) –If
True
, the conjugate gradient solver is initialized with the solution from the previous optimization step. This can accelerate convergence, especially in truncated Newton methods. Defaults to False. -
inner
(Chainable | None
, default:None
) –NewtonCG will attempt to apply preconditioning to the output of this module.
Examples: Newton-CG with a backtracking line search:
Truncated Newton method (useful for large-scale problems):
Source code in torchzero/modules/second_order/newton_cg.py
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 |
|
NewtonCGSteihaug ¶
Bases: torchzero.core.module.Module
Newton's method with trust region and a matrix-free Steihaug-Toint conjugate gradient solver.
Notes
-
In most cases NewtonCGSteihaug should be the first module in the chain because it relies on autograd. Use the
inner
argument if you wish to apply Newton preconditioning to another module's output. -
This module requires the a closure passed to the optimizer step, as it needs to re-evaluate the loss and gradients for calculating HVPs. The closure must accept a
backward
argument (refer to documentation).
Parameters:
-
eta
(float
, default:0.0
) –if ratio of actual to predicted rediction is larger than this, step is accepted. Defaults to 0.0.
-
nplus
(float
, default:3.5
) –increase factor on successful steps. Defaults to 1.5.
-
nminus
(float
, default:0.25
) –decrease factor on unsuccessful steps. Defaults to 0.75.
-
rho_good
(float
, default:0.99
) –if ratio of actual to predicted rediction is larger than this, trust region size is multiplied by
nplus
. -
rho_bad
(float
, default:0.0001
) –if ratio of actual to predicted rediction is less than this, trust region size is multiplied by
nminus
. -
init
(float
, default:1
) –Initial trust region value. Defaults to 1.
-
max_attempts
(max_attempts
, default:100
) –maximum number of trust radius reductions per step. A zero update vector is returned when this limit is exceeded. Defaults to 10.
-
max_history
(int
, default:100
) –CG will store this many intermediate solutions, reusing them when trust radius is reduced instead of re-running CG. Each solution storage requires 2N memory. Defaults to 100.
-
boundary_tol
(float | None
, default:1e-06
) –The trust region only increases when suggested step's norm is at least
(1-boundary_tol)*trust_region
. This prevents increasing trust region when solution is not on the boundary. Defaults to 1e-2. -
maxiter
(int | None
, default:None
) –maximum number of CG iterations per step. Each iteration requies one backward pass if
hvp_method="forward"
, two otherwise. Defaults to None. -
miniter
(int
, default:1
) –minimal number of CG iterations. This prevents making no progress
-
tol
(float
, default:1e-08
) –terminates CG when norm of the residual is less than this value. Defaults to 1e-8. when initial guess is below tolerance. Defaults to 1.
-
reg
(float
, default:1e-08
) –hessian regularization. Defaults to 1e-8.
-
solver
(str
, default:'cg'
) –solver, "cg" or "minres". "cg" is recommended. Defaults to 'cg'.
-
adapt_tol
(bool
, default:True
) –if True, whenever trust radius collapses to smallest representable number, the tolerance is multiplied by 0.1. Defaults to True.
-
npc_terminate
(bool
, default:False
) –whether to terminate CG/MINRES whenever negative curvature is detected. Defaults to False.
-
hvp_method
(str
, default:'central'
) –either "forward" to use forward formula which requires one backward pass per Hvp, or "central" to use a more accurate central formula which requires two backward passes. "forward" is usually accurate enough. Defaults to "forward".
-
h
(float
, default:0.001
) –finite difference step size. Defaults to 1e-3.
-
inner
(Chainable | None
, default:None
) –applies preconditioning to output of this module. Defaults to None.
Examples:¶
Trust-region Newton-CG:
Reference:¶
Steihaug, Trond. "The conjugate gradient method and trust regions in large scale optimization." SIAM Journal on Numerical Analysis 20.3 (1983): 626-637.
Source code in torchzero/modules/second_order/newton_cg.py
178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 |
|
NystromPCG ¶
Bases: torchzero.core.module.Module
Newton's method with a Nyström-preconditioned conjugate gradient solver. This tends to outperform NewtonCG but requires tuning sketch size. An adaptive version exists in https://arxiv.org/abs/2110.02820, I might implement it too at some point.
.. note::
This module requires the a closure passed to the optimizer step,
as it needs to re-evaluate the loss and gradients for calculating HVPs.
The closure must accept a backward
argument (refer to documentation).
.. note::
In most cases NystromPCG should be the first module in the chain because it relies on autograd. Use the :code:inner
argument if you wish to apply Newton preconditioning to another module's output.
Parameters:
-
sketch_size
(int
) –size of the sketch for preconditioning, this many hessian-vector products will be evaluated before running the conjugate gradient solver. Larger value improves the preconditioning and speeds up conjugate gradient.
-
maxiter
(int | None
, default:None
) –maximum number of iterations. By default this is set to the number of dimensions in the objective function, which is supposed to be enough for conjugate gradient to have guaranteed convergence. Setting this to a small value can still generate good enough directions. Defaults to None.
-
tol
(float
, default:0.001
) –relative tolerance for conjugate gradient solver. Defaults to 1e-4.
-
reg
(float
, default:1e-06
) –regularization parameter. Defaults to 1e-8.
-
hvp_method
(str
, default:'autograd'
) –Determines how Hessian-vector products are evaluated.
"autograd"
: Use PyTorch's autograd to calculate exact HVPs. This requires creating a graph for the gradient."forward"
: Use a forward finite difference formula to approximate the HVP. This requires one extra gradient evaluation."central"
: Use a central finite difference formula for a more accurate HVP approximation. This requires two extra gradient evaluations. Defaults to "autograd".
-
h
(float
, default:0.001
) –finite difference step size if :code:
hvp_method
is "forward" or "central". Defaults to 1e-3. -
inner
(Chainable | None
, default:None
) –modules to apply hessian preconditioner to. Defaults to None.
-
seed
(int | None
, default:None
) –seed for random generator. Defaults to None.
Examples:
NystromPCG with backtracking line search
.. code-block:: python
opt = tz.Modular(
model.parameters(),
tz.m.NystromPCG(10),
tz.m.Backtracking()
)
Reference
Frangella, Z., Tropp, J. A., & Udell, M. (2023). Randomized nyström preconditioning. SIAM Journal on Matrix Analysis and Applications, 44(2), 718-752. https://arxiv.org/abs/2110.02820
Source code in torchzero/modules/second_order/nystrom.py
136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 |
|
NystromSketchAndSolve ¶
Bases: torchzero.core.module.Module
Newton's method with a Nyström sketch-and-solve solver.
.. note::
This module requires the a closure passed to the optimizer step,
as it needs to re-evaluate the loss and gradients for calculating HVPs.
The closure must accept a backward
argument (refer to documentation).
.. note::
In most cases NystromSketchAndSolve should be the first module in the chain because it relies on autograd. Use the :code:inner
argument if you wish to apply Newton preconditioning to another module's output.
.. note::
If this is unstable, increase the :code:reg
parameter and tune the rank.
.. note:
:code:tz.m.NystromPCG
usually outperforms this.
Parameters:
-
rank
(int
) –size of the sketch, this many hessian-vector products will be evaluated per step.
-
reg
(float
, default:0.001
) –regularization parameter. Defaults to 1e-3.
-
hvp_method
(str
, default:'autograd'
) –Determines how Hessian-vector products are evaluated.
"autograd"
: Use PyTorch's autograd to calculate exact HVPs. This requires creating a graph for the gradient."forward"
: Use a forward finite difference formula to approximate the HVP. This requires one extra gradient evaluation."central"
: Use a central finite difference formula for a more accurate HVP approximation. This requires two extra gradient evaluations. Defaults to "autograd".
-
h
(float
, default:0.001
) –finite difference step size if :code:
hvp_method
is "forward" or "central". Defaults to 1e-3. -
inner
(Chainable | None
, default:None
) –modules to apply hessian preconditioner to. Defaults to None.
-
seed
(int | None
, default:None
) –seed for random generator. Defaults to None.
Examples:
NystromSketchAndSolve with backtracking line search
.. code-block:: python
opt = tz.Modular(
model.parameters(),
tz.m.NystromSketchAndSolve(10),
tz.m.Backtracking()
)
Reference
Frangella, Z., Tropp, J. A., & Udell, M. (2023). Randomized nyström preconditioning. SIAM Journal on Matrix Analysis and Applications, 44(2), 718-752. https://arxiv.org/abs/2110.02820
Source code in torchzero/modules/second_order/nystrom.py
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 |
|
SixthOrder3P ¶
Bases: torchzero.modules.second_order.multipoint.HigherOrderMethodBase
Sixth-order iterative method.
Abro, Hameer Akhtar, and Muhammad Mujtaba Shaikh. "A new time-efficient and convergent nonlinear solver." Applied Mathematics and Computation 355 (2019): 516-536.
Source code in torchzero/modules/second_order/multipoint.py
SixthOrder3PM2 ¶
Bases: torchzero.modules.second_order.multipoint.HigherOrderMethodBase
Wang, Xiaofeng, and Yang Li. "An efficient sixth-order Newton-type method for solving nonlinear systems." Algorithms 10.2 (2017): 45.
Source code in torchzero/modules/second_order/multipoint.py
SixthOrder5P ¶
Bases: torchzero.modules.second_order.multipoint.HigherOrderMethodBase
Argyros, Ioannis K., et al. "Extended convergence for two sixth order methods under the same weak conditions." Foundations 3.1 (2023): 127-139.
Source code in torchzero/modules/second_order/multipoint.py
TwoPointNewton ¶
Bases: torchzero.modules.second_order.multipoint.HigherOrderMethodBase
two-point Newton method with frozen derivative with third order convergence.
Sharma, Janak Raj, and Deepak Kumar. "A fast and efficient composite Newton–Chebyshev method for systems of nonlinear equations." Journal of Complexity 49 (2018): 56-73.