Experimental¶
This subpackage contains various horrible atrocities that are generally less tested.
Those are various ideas of mine plus some other modules that I decided not to move to other sub-packages for whatever reason. This is generally less tested.
Classes:
-
BlockPartition–splits parameters into blocks (for now flatttens them and chunks)
-
CoordinateMomentum–Maintains a momentum buffer, on each step each value in the buffer has
pchance to be updated with the new value. -
CubicAdam–Adam which has 3rd momentum and minimizes a cubic polynomial.
-
CurveBall–CurveBall method from https://arxiv.org/pdf/1805.08095#page=4.09.
-
FFTProjection–Project update into Fourier space of real-valued inputs.
-
GradMin–Reformulates the objective to minimize sum of gradient magnitudes via autograd. This is not expected to be practical.
-
HigherOrderNewton–A basic arbitrary order newton's method with optional trust region and proximal penalty.
-
InfinityNormTrustRegion–Trust region with L-infinity norm via
scipy.optimize.lsq_linear. -
NewtonNewton–Applies Newton-like preconditioning to Newton step.
-
NewtonSolver–Matrix free newton via with any custom solver (this is for testing, use NewtonCG or NystromPCG).
-
ReduceOutwardLR–When update sign matches weight sign, the learning rate for that weight is multiplied by
mul. -
ScipyNewtonCG–NewtonCG with scipy solvers (any from scipy.sparse.linalg)
-
SubspaceCubicAdam–Runs cubic Adam in low rank eigenbasis.
-
TensorizeProjection–flattens and concatenates all parameters into a vector and then reshapes it into a tensor
BlockPartition ¶
Bases: torchzero.modules.projections.projection.ProjectionBase
splits parameters into blocks (for now flatttens them and chunks)
Source code in torchzero/modules/experimental/structural_projections.py
CoordinateMomentum ¶
Bases: torchzero.core.transform.TensorTransform
Maintains a momentum buffer, on each step each value in the buffer has p chance to be updated with the new value.
Parameters:
-
p(float, default:0.1) –description. Defaults to 0.1.
Source code in torchzero/modules/experimental/coordinate_momentum.py
CubicAdam ¶
Bases: torchzero.core.transform.TensorTransform
Adam which has 3rd momentum and minimizes a cubic polynomial.
Source code in torchzero/modules/experimental/cubic_adam.py
CurveBall ¶
Bases: torchzero.core.transform.Transform
CurveBall method from https://arxiv.org/pdf/1805.08095#page=4.09.
For now this implementation does not include automatic ρ, α and β hyper-parameters in closed form, therefore it is expected to underperform compared to official implementation (https://github.com/jotaf98/pytorch-curveball/tree/master) so I moved this to experimental.
Parameters:
-
precond_lr(float, default:0.001) –learning rate for updating preconditioned gradients. Defaults to 1e-3.
-
momentum(float, default:0.9) –decay rate for preconditioned gradients. Defaults to 0.9.
-
hvp_method(str, default:'autograd') –how to calculate hessian vector products. Defaults to "autograd".
-
h(float, default:0.001) –finite difference step size for when hvp_method is set to finite difference. Defaults to 1e-3.
-
reg(float, default:1) –hessian regularization. Defaults to 1.
-
inner(Chainable | None, default:None) –Inner modules. Defaults to None.
Source code in torchzero/modules/experimental/curveball.py
FFTProjection ¶
Bases: torchzero.modules.projections.projection.ProjectionBase
Project update into Fourier space of real-valued inputs.
Parameters:
-
modules(Chainable) –modules that will optimize the projected update.
-
one_d(bool, default:False) –- If True, uses 1d fft on parameters concatenated into a vector.
- If False, uses n-dimensional fft on each parameter (default).
-
norm(str, default:None) –Normalization mode.
- "forward" - normalize by 1/n
- "backward" - no normalization
- "ortho" - normalize by 1/sqrt(n) (making the FFT orthonormal)
Calling the backward transform (:func:
~torch.fft.irfft) with the same normalization mode will apply an overall normalization of1/nbetween the two transforms. This is required to make :func:~torch.fft.irfftthe exact inverse.Default is "backward" (no normalization).
The actual torch.fft.rfft default is None, so I set it to None too. I guess None and "backward" are the same.
Source code in torchzero/modules/experimental/fft.py
GradMin ¶
Bases: torchzero.core.reformulation.Reformulation
Reformulates the objective to minimize sum of gradient magnitudes via autograd. This is not expected to be practical.
Parameters:
-
loss_term(float, default:0) –adds loss value times this to sum of gradient magnitudes. Defaults to 1.
-
relative(bool, default:None) –whether to make loss_term relative to gradient magnitude. Defaults to False.
-
graft(bool, default:None) –whether to make loss term same as gradient magnitude. Defaults to False.
-
square(bool, default:False) –whether to use sum of squared gradient magnitudes, if False uses absolute values. Defaults to False.
-
mean(bool, default:True) –whether to use mean, if False uses sum. Defaults to True.
-
maximize_grad(bool, default:False) –whether to maximize gradient magnitudes instead of minimizing. Defaults to False.
-
create_graph(bool, default:False) –whether to create graph. Defaults to False.
-
modify_loss(bool, default:True) –whether to modify the loss value to make line searches minimize new objective. Defaults to True.
Source code in torchzero/modules/experimental/gradmin.py
HigherOrderNewton ¶
Bases: torchzero.core.module.Module
A basic arbitrary order newton's method with optional trust region and proximal penalty.
This constructs an nth order taylor approximation via autograd and minimizes it with
scipy.optimize.minimize trust region newton solvers with optional proximal penalty.
The hessian of taylor approximation is easier to evaluate, plus it can be evaluated in a batched mode, so it can be more efficient in very specific instances.
Notes
- In most cases HigherOrderNewton should be the first module in the chain because it relies on extra autograd. Use the
innerargument if you wish to apply Newton preconditioning to another module's output. - This module requires the a closure passed to the optimizer step, as it needs to re-evaluate the loss and gradients for calculating higher order derivatives. The closure must accept a
backwardargument (refer to documentation). - this uses roughly O(N^order) memory and solving the subproblem is very expensive.
- "none" and "proximal" trust methods may generate subproblems that have no minima, causing divergence.
Args:
order (int, optional):
Order of the method, number of taylor series terms (orders of derivatives) used to approximate the function. Defaults to 4.
trust_method (str | None, optional):
Method used for trust region.
- "bounds" - the model is minimized within bounds defined by trust region.
- "proximal" - the model is minimized with penalty for going too far from current point.
- "none" - disables trust region.
Defaults to 'bounds'.
increase (float, optional): trust region multiplier on good steps. Defaults to 1.5.
decrease (float, optional): trust region multiplier on bad steps. Defaults to 0.75.
trust_init (float | None, optional):
initial trust region size. If none, defaults to 1 on :code:`trust_method="bounds"` and 0.1 on ``"proximal"``. Defaults to None.
trust_tol (float, optional):
Maximum ratio of expected loss reduction to actual reduction for trust region increase.
Should 1 or higer. Defaults to 2.
de_iters (int | None, optional):
If this is specified, the model is minimized via differential evolution first to possibly escape local minima,
then it is passed to scipy.optimize.minimize. Defaults to None.
vectorize (bool, optional): whether to enable vectorized jacobians (usually faster). Defaults to True.
Source code in torchzero/modules/experimental/higher_order_newton.py
138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 | |
InfinityNormTrustRegion ¶
Bases: torchzero.modules.trust_region.trust_region.TrustRegionBase
Trust region with L-infinity norm via scipy.optimize.lsq_linear.
Parameters:
-
hess_module(Module | None) –A module that maintains a hessian approximation (not hessian inverse!). This includes all full-matrix quasi-newton methods,
tz.m.Newtonandtz.m.GaussNewton. When using quasi-newton methods, setinverse=Falsewhen constructing them. -
eta(float, default:0.0) –if ratio of actual to predicted rediction is larger than this, step is accepted. When :code:
hess_moduleis GaussNewton, this can be set to 0. Defaults to 0.15. -
nplus(float, default:3.5) –increase factor on successful steps. Defaults to 1.5.
-
nminus(float, default:0.25) –decrease factor on unsuccessful steps. Defaults to 0.75.
-
rho_good(float, default:0.99) –if ratio of actual to predicted rediction is larger than this, trust region size is multiplied by
nplus. -
rho_bad(float, default:0.0001) –if ratio of actual to predicted rediction is less than this, trust region size is multiplied by
nminus. -
init(float, default:1) –Initial trust region value. Defaults to 1.
-
update_freq(int, default:1) –frequency of updating the hessian. Defaults to 1.
-
max_attempts(max_attempts, default:10) –maximum number of trust region size size reductions per step. A zero update vector is returned when this limit is exceeded. Defaults to 10.
-
boundary_tol(float | None, default:None) –The trust region only increases when suggested step's norm is at least
(1-boundary_tol)*trust_region. This prevents increasing trust region when solution is not on the boundary. Defaults to 1e-2. -
tol(float | None, default:1e-10) –tolerance for least squares solver.
-
fallback(bool) –if
True, whenhess_modulemaintains hessian inverse which can't be inverted efficiently, it will be inverted anyway. WhenFalse(default), aRuntimeErrorwill be raised instead. -
inner(Chainable | None, default:None) –preconditioning is applied to output of thise module. Defaults to None.
Examples:
BFGS with infinity-norm trust region
.. code-block:: python
opt = tz.Optimizer(
model.parameters(),
tz.m.InfinityNormTrustRegion(hess_module=tz.m.BFGS(inverse=False)),
)
Source code in torchzero/modules/experimental/l_infinity.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 | |
NewtonNewton ¶
Bases: torchzero.core.transform.Transform
Applies Newton-like preconditioning to Newton step.
This is a method that I thought of and then it worked. Here is how it works:
-
Calculate newton step by solving Hx=g
-
Calculate jacobian of x wrt parameters and call it H2
-
Solve H2 x2 = x for x2.
-
Optionally, repeat (if order is higher than 3.)
Source code in torchzero/modules/experimental/newtonnewton.py
NewtonSolver ¶
Bases: torchzero.core.module.Module
Matrix free newton via with any custom solver (this is for testing, use NewtonCG or NystromPCG).
Source code in torchzero/modules/experimental/newton_solver.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 | |
ReduceOutwardLR ¶
Bases: torchzero.core.transform.TensorTransform
When update sign matches weight sign, the learning rate for that weight is multiplied by mul.
This means updates that move weights towards zero have higher learning rates.
Warning
This sounded good but after testing turns out it sucks.
Source code in torchzero/modules/experimental/reduce_outward_lr.py
ScipyNewtonCG ¶
Bases: torchzero.core.module.Module
NewtonCG with scipy solvers (any from scipy.sparse.linalg)
Source code in torchzero/modules/experimental/scipy_newton_cg.py
SubspaceCubicAdam ¶
Bases: torchzero.modules.adaptive.lre_optimizers.LREOptimizerBase
Runs cubic Adam in low rank eigenbasis.
Source code in torchzero/modules/experimental/cubic_adam.py
TensorizeProjection ¶
Bases: torchzero.modules.projections.projection.ProjectionBase
flattens and concatenates all parameters into a vector and then reshapes it into a tensor