Post

Simple explanation of Quasi-Newton methods

Simple explanation of the secant equation and how quasi-newton methods use it.

Simple explanation of Quasi-Newton methods

Preface

At time step $t$ the parameters will be denoted as $\mathbf{x}_t$; gradient at those parameters as $\mathbf{g}(\mathbf{x}_t)$ and Hessian as $H(\mathbf{x}_t)$.

For this article I assume the reader knows what gradient and Hessian are, and is familiar with Newton’s method for optimization, which in the most basic form performs the following update: $$ \mathbf{x}_{t+1} \leftarrow \mathbf{x}_t - H(\mathbf{x}_t)^{-1}\mathbf{g}(\mathbf{x}_t) $$

Introduction

Quasi-newton methods approximate the Hessian or Hessian inverse using only gradients. Let’s see how they do that!

Secant equation

Secant equation is a core concept in Quasi-Newton methods.

Suppose we have Hessian $H$, some vector $\mathbf{s}$, and we compute a Hessian-vector product with it and call it $\mathbf{y}$. So we get: $$ H\mathbf{s} = \mathbf{y} $$ This is the secant equation. If we had $\mathbf{s}$ and $\mathbf{y}$, we would know that the Hessian must satisfy $H\mathbf{s} = \mathbf{y}$.

It might seem that in order to obtain $\mathbf{y}$, one needs the Hessian to compute $H\mathbf{s}$. But there is a simple way to get $\mathbf{s}$ and $\mathbf{y}$ basically for free without knowing the Hessian.

Finite-difference Hessian-vector products

If we compute difference between consequtive parameters and assign it to $\mathbf{s}_t$, and assign difference between consequtive gradients to $\mathbf{y}_t$, then $\mathbf{y}_t$ approximates Hessian-vector product with $\mathbf{s}_t$: $$ \mathbf{s}_t = \mathbf{x}_t - \mathbf{x}_{t-1} $$ $$ \mathbf{y}_t = \mathbf{g}(\mathbf{x}_t) - \mathbf{g}(\mathbf{x}_{t-1}) $$ $$ \text{then: } H(\mathbf{x}_t) \mathbf{s_t} \approx \mathbf{y}_t $$

To understand why, we can look at the formula of a finite-differerence approximation to a Hessian-vector product.

A Hessian-vector product, such as $H(\mathbf{x})\mathbf{s}$, can be approximated using this (backward finite difference) formula: $$ H(\mathbf{x})\mathbf{s} \approx \frac{\mathbf{g}(\mathbf{x}) - \mathbf{g}(\mathbf{x} - r\mathbf{s})}{r} $$

Here $\mathbf{x}$ is the parameters where Hessian-vector product is approximated, and $r$ is a small constant. With infinite precision the approximation becomes exact in the limit $r \rightarrow 0$.

Now let’s set $r$ to 1 and replace $\mathbf{s}$ with $\mathbf{x}_t - \mathbf{x}_{t-1}$: $$ H(\mathbf{x}_t) \mathbf{s}_t \approx \frac{\mathbf{g}(\mathbf{x}_t) - \mathbf{g}(\mathbf{x}_t - 1 (\mathbf{x}_t - \mathbf{x}_{t-1}))}{1} = \mathbf{g}(\mathbf{x}_t) - \mathbf{g}(\mathbf{x}_{t-1}) = \mathbf{y}_t $$ so: $$ H(\mathbf{x}_t) \mathbf{s}_t \approx \mathbf{y}_t $$

Of course by setting $r$ to 1, it seems that the approximation would be crude, but in practice it works exceptionally well in Quasi-Newton methods.

Least-change principle

Quasi-Newton methods maintain a hessian approximation $B$ (or it’s inverse) and refine it on each iteration.

Suppose we have current Hessian approximation $B_{t-1}$, and we obtain $\mathbf{s}_t = \mathbf{x}_t - \mathbf{x}_{t-1}$ and $\mathbf{y}_t = \mathbf{g}(\mathbf{x}_t) - \mathbf{g}(\mathbf{x}_{t-1})$. We know that $H(\mathbf{x}_t) \mathbf{s}_t \approx \mathbf{y}_t$, therefore we want our new refined hessian approximation $B_t$ to satisfy $B_t \mathbf{s}_t = \mathbf{y}_t$. However on it’s own this isn’t too useful as there are infinitely many possible $B_t$ that satisfy $B_t \mathbf{s}_t = \mathbf{y}_t$.

The idea of the least-change principle is to pick $B_t$ such that it is close to $B_{t-1}$, preferably as close as possible. Since $B_{t-1}$ might’ve accumulated a lot of curvature information from previous iterations, we don’t want to throw that information away - keeping $B_t$ close to $B_{t-1}$ makes sure as little of that information is lost as possible. “Close” can be interpreted in many different ways, leading to different Quasi-Newton formulas.

Least-change principle is a heuristic (to my knowledge), but it works exceptionally well in practice and converges to true Hessian. There are multiple interpretations of why it works, for example:

  • Lukšan, L., & Spedicato, E. (2000). Variable metric methods for unconstrained optimization and nonlinear least squares. Journal of Computational and Applied Mathematics, 124(1-2), 61-95.:

    Roughly speaking, the least-change principle guarantees that as much information from previous iterations as possible is saved while the quasi-Newton condition brings new information because it is satisfied by matrix.

  • Greenstadt, J. (1970). Variations on variable-metric methods.(With discussion). Mathematics of Computation, 24(109), 1-22.:

    Let us ask for the “best” correction in some sense. There are many possible choices to make, but a good one is to ask for the smallest correction, in the sense of some norm. To a certain extent, this would tend to keep the elements of [$B^{−1}_k$ ] from growing too large, which might cause an undesirable instability.

Deriving a Quasi-Newton method

Now let’s derive a Quasi-Newton method. We will derive the Symmetric Rank 1 (SR1) method because it is simple.

What we have:

  • current Hessian approximation $B_{t-1}$;
  • $\mathbf{s}_t = \mathbf{x}_t - \mathbf{x}_{t-1}$
  • $\mathbf{y}_t = \mathbf{g}(\mathbf{x}_t) - \mathbf{g}(\mathbf{x}_{t-1})$;

What we want:

  • $B_t$ that satisfies the secant equation $B_t\mathbf{s}_t=\mathbf{y}_t$
  • we want $B_t$ to be close to $B_{t-1}$ in some sense, so that little information is lost from $B_t$
  • there are other useful constraints one can impose on $B_t$:
    • usually $B_t$ is constrained to be symmetric: $B_t=B_t^{\text{T}}$.
    • $B_t$ is often constrained to be positive-definite, however the SR1 formula does not do that.

A simple way to keep $B_t$ symmetric and not too far from $B_{t-1}$ is to perform a symmetric rank-1 update: $B_t = B_{t-1} + \alpha\mathbf{u}\mathbf{u}^{\text{T}}$, where $\alpha \mathbf{u}\mathbf{u}^{\text{T}}$ is a symmetric rank-1 correction which we can try to derive a formula for.

Let’s plug that into the secant equation $B_t\mathbf{s_t} = \mathbf{y_t}$, replacing $B_t$ with $B_{t-1} + \alpha\mathbf{u}\mathbf{u}^{\text{T}}$. I leave the entire derivation there, but basically there is a single unique solution.

$$(B_{t-1} + \alpha\mathbf{u}\mathbf{u}^{\text{T}})\mathbf{s}_t = \mathbf{y}_t$$

$$ \begin{gather*} \text{(1) unpack brackets:} \\ \mathbf{y}_t = B_{t-1}\mathbf{s}_t + \alpha\mathbf{u}\mathbf{u}^{\text{T}} \mathbf{s}_t \end{gather*} $$

$$ \begin{gather*} \text{(2) rearrange:}\\ \alpha\mathbf{u}\mathbf{u}^{\text{T}} \mathbf{s}_t = \mathbf{y}_t - B_{t-1}\mathbf{s}_t \end{gather*} $$

$$ \begin{gather*} \text{(3) transpose both sides:}\\ (\mathbf{y}_t - B_{t-1}\mathbf{s}_t)^{\text{T}} = \alpha \mathbf{s}_t^{\text{T}}\mathbf{u}\mathbf{u}^{\text{T}} \end{gather*} $$

$$ \begin{gather*} \text{(4) multiply (2) by (3):}\\ (\mathbf{y}_t - B_{t-1}\mathbf{s}_t)(\mathbf{y}_t - B_{t-1}\mathbf{s}_t)^{\text{T}} = (\alpha \mathbf{u} \mathbf{u}^{\text{T}} \mathbf{s}_t) (\alpha \mathbf{s}_t^{\text{T}} \mathbf{u} \mathbf{u}^{\text{T}}) \end{gather*} $$

$$ \begin{gather*} \text{(5) rearrange right-hand side:}\\ (\mathbf{y}_t - B_{t-1}\mathbf{s}_t)(\mathbf{y}_t - B_{t-1}\mathbf{s}_t)^{\text{T}} = \alpha^2 \cdot (\mathbf{u}^{\text{T}} \mathbf{s}_t) \cdot (\mathbf{s_t}^{\text{T}}\mathbf{u}) \cdot \mathbf{u} \mathbf{u}^{\text{T}} = \alpha^2 (\mathbf{u}^{\text{T}} \mathbf{s}_t)^2 \mathbf{u} \mathbf{u}^{\text{T}} \end{gather*} $$

$$ \begin{gather*} \text{(6) divide by } \alpha \cdot (\mathbf{u}^{\text{T}} \mathbf{s}_t)^2:\\ \alpha \mathbf{u} \mathbf{u}^{\text{T}} = \frac{(\mathbf{y}_t - B_{t-1}\mathbf{s}_t)(\mathbf{y}_t - B_{t-1}\mathbf{s}_t)^{\text{T}}}{\alpha \cdot (\mathbf{u}^{\text{T}} \mathbf{s}_t)^2} \end{gather*} $$

Second part of the derivation is to get rid of $(\mathbf{u}^{\text{T}} \mathbf{s}_t)^2$ in the denominator:

$$(B_{t-1} + \alpha\mathbf{u}\mathbf{u}^{\text{T}})\mathbf{s}_t = \mathbf{y_t}$$

$$ \begin{gather*} \text{(7) left-multiply by } {\mathbf{s}_t^{\text{T}}}:\\ \mathbf{s}_t^{\text{T}}\mathbf{y}_t = \mathbf{s}_t^{\text{T}} B_{t-1} \mathbf{s_t} + \mathbf{s}_t^{\text{T}} (\alpha\mathbf{u}\mathbf{u}^{\text{T}} \mathbf{s}_t) \end{gather*} $$

$$ \mathbf{s}_t^{\text{T}}\mathbf{y}_t = \mathbf{s}_t^{\text{T}} B_{t-1} \mathbf{s_t} + \alpha(\mathbf{s_t}^{\text{T}}\mathbf{u})(\mathbf{u}^{\text{T}} \mathbf{s}_t) $$

$$ \mathbf{s}_t^{\text{T}}\mathbf{y}_t = \mathbf{s}_t^{\text{T}} B_{t-1} \mathbf{s_t} + \alpha(\mathbf{u}^{\text{T}} \mathbf{s}_t)^2 $$

$$ \begin{gather*}\text{(8) rearrange:}\\ (\mathbf{u}^{\text{T}} \mathbf{s}_t)^2 = \mathbf{s}_t^{\text{T}} (\mathbf{y}_t - B_{t-1}\mathbf{s}_t) \end{gather*} $$

Now we put (8) into (6) and we get SR1 correction formula: $$ \alpha \mathbf{u} \mathbf{u}^{\text{T}} = \frac{(\mathbf{y}_t - B_{t-1}\mathbf{s}_t)(\mathbf{y}_t - B_{t-1}\mathbf{s}_t)^{\text{T}}}{\mathbf{s}_t^{\text{T}} (\mathbf{y}_t - B_{t-1}\mathbf{s}_t)} $$

So, given $B_{t-1}$, $\mathbf{s}_t$ and $\mathbf{y}_t$, the Hessian approximation is updated like this: $$ B_t \leftarrow B_{t-1} + \frac{(\mathbf{y}_t - B_{t-1}\mathbf{s}_t)(\mathbf{y}_t - B_{t-1}\mathbf{s}_t)^{\text{T}}}{\mathbf{s}_t^{\text{T}} (\mathbf{y}_t - B_{t-1}\mathbf{s}_t)} $$

We can also do the same for Hessian inverse. $B_t\mathbf{s}_t=\mathbf{y}_t$, therefore $B_t^{-1}\mathbf{y}_t=\mathbf{s}_t$. We can repeat the entire derivation above, simply swapping $\mathbf{s}_t$ and $\mathbf{y}_t$, and we get a formula which updates Hessian inverse approximation: $$ B_t^{-1} \leftarrow B_{t-1}^{-1} + \frac{(\mathbf{s}_t - B_{t-1}^{-1}\mathbf{y}_t)(\mathbf{s}_t - B_{t-1}^{-1}\mathbf{y}_t)^{\text{T}}}{\mathbf{y}_t^{\text{T}} (\mathbf{s}_t - B_{t-1}^{-1}\mathbf{y}_t)} $$

Note that swapping $\mathbf{s}_t$ and $\mathbf{y}_t$ works for SR1, but it doesn’t work for all formulas. For example BFGS and DFP methods are duals - if you swap $\mathbf{s}_t$ and $\mathbf{y}_t$ in BFGS formula for Hessian, you don’t get BFGS formula for inverse Hessian, instead you get DFP formula for inverse Hessian, and vice-versa.

Final algorithm

Now that we have a formula for updating Hessian or Hessian inverse approximation, let’s see how it is applid to minimize a function.

Initialization

First we need to initialize the Hessian approximation $B_0$. Usually it is initialized to a scaled identity matrix. It can also be initialized to the true Hessian matrix or some guess of it.

A heuristic to determine scale of the initial identity matrix is described in Wright, Stephen, and Jorge Nocedal. “Numerical optimization.” Springer Science 35.67-68 (1999): 7. on p.142, and it is the following: $$ B_1 = I \cdot \frac{\mathbf{y}_1^{\text{T}}\mathbf{y}_1}{\mathbf{y}_1^{\text{T}} \mathbf{s}_1} $$ and for Hessian inverse: $$ B_1^{-1} = I \cdot \frac{\mathbf{y}_1^{\text{T}}\mathbf{s}_1}{\mathbf{y}_1^{\text{T}} \mathbf{y}_1} $$

But for this formula we need $\mathbf{s}$ and $\mathbf{y}$ - differences between consequtive parameters and gradients, which are not available on the first step since we don’t yet have previous parameters and gradients. So on first step we set $B_0=I$, meaning first step is just a gradient descent step, and on second step we obtain our first $\mathbf{s_1}$ and $\mathbf{y_1}$ and set $B_1=I * \frac{\mathbf{y}_1^{\text{T}}\mathbf{y}_1}{\mathbf{y}_1^{\text{T}} \mathbf{s}_1}$.

This heuristic is actually the Barzilai–Borwein step size. Barzilai–Borwein method uses $I \cdot \frac{\mathbf{y}_1^{\text{T}}\mathbf{y}_1}{\mathbf{y}_1^{\text{T}} \mathbf{s}_1}$ as the hessian approximation. This scaling corresponds to the least-squares solution to $B^{-1}\mathbf{y}=\mathbf{s}$ where $B^{-1}$ is constrained to be scaled identity; it is the only solution, so least-change principle can’t be used here.

Performing optimization

There are multiple ways to perform optimization with a Quasi-Newton Hessian approximation, it can be used with a line search or a trust region. Using a fixed step size is less common as it isn’t stable enough. I must note that SR1 that we derived above tends to be unstable with a line search, and is much more suitable for trust region. The most popular Quasi-Newton formula, BFGS, works well with both line search and trust region. Here we will review the line search approach because it is simpler.

The algorithm for most Quasi-Newton methods is very similar. We have $B_{t-1}$ - current Hessian approximation, or $B_{t-1}^{-1}$ - current hessian inverse approximation. On time step $t$ it is the following:

$1.$ compute $\mathbf{s}_t$ - difference between current and previous parameters:

$$\mathbf{s}_t = \mathbf{x}_t - \mathbf{x}_{t-1}$$

$2.$ compute $\mathbf{y}_t$ - difference between current and previous gradients:

$$\mathbf{y}_t = \mathbf{g}(\mathbf{x}_t) - \mathbf{g}(\mathbf{x}_{t-1})$$

$3.$ Update Hessian approximation (or Hessian inverse approximation), using some Quasi-Newton formula. Maintaining a hessian inverse approximation is more efficient since we will need to compute $B_t^{-1} \mathbf{g}(\mathbf{x}_t)$. We could use the SR1 formula derived above, but since it doesn’t work well with line searches, let’s use the BFGS formula:

$$ B_{t}^{-1} \leftarrow B_{t-1}^{-1} + \frac{(\mathbf{s}_t^T \mathbf{y}_t + \mathbf{y}_t^T B_{t-1}^{-1} \mathbf{y}_t) \mathbf{s}_t \mathbf{s}_t^T}{(\mathbf{s}_t^T \mathbf{y}_t)^2} - \frac{B_{t-1}^{-1} \mathbf{y}_t \mathbf{s}_t^T + \mathbf{s}_t \mathbf{y}_t^T B_{t-1}^{-1}}{\mathbf{s}_t^T \mathbf{y}_t} $$

$4.$ Determine step size $\gamma_t$, usually via a line search along the Quasi-Newton direction $B_t^{-1} \mathbf{g}(\mathbf{x}_t)$:

$$\gamma_t = \underset{\gamma}{\text{argmin}} f(x_t - \gamma B_t^{-1} \mathbf{g}(\mathbf{x}_t))$$

$5.$ Update the parameters:

$$x_{t+1} \leftarrow x_t - \gamma_t B_t^{-1} \mathbf{g}(\mathbf{x}_t)$$

Further developments

I will now briefly outline some important modifications to Quasi-Newton (QN) methods. This is not an exhaustive list and I might update it in the future.

Limited-memory

Limited-memory variants are probably the most important and widely used modifications of QN methods. For a problem with $n$ variables standard QN uses at least $n^2$ memory to store $B$ or $B^{-1}$, so it can’t be applied to large scale problems.

Limited-memory variants store $B$ more efficiently and usually only consider a limited history of past $\mathbf{s}_t$ and $\mathbf{y}_t$ pairs.

Let’s say we’ve initialized $B$ to scaled identity and performed $k$ updates with the BFGS formula, where each update is a rank-2 correction. After $k$ rank-2 updates $B$ it becomes scaled identity plus at most rank-$2k$ symmetric matrix. This can be represented as $B = \alpha I + U U^T$, where $U$ is a $n \times 2k$ matrix. This representation is called compact representation and only requires $n \times 2k$ memory.

Alternatively it is possible to store past $k$ pairs of $\mathbf{s}_t$ and $\mathbf{y}_t$, and compute $B^{-1}\mathbf{g}(\mathbf{x}_t)$ via two loop recursion, never explicitly forming $B^{-1}$, this also only requires $n \times 2k$ memory. This is used, for example, by the L-BFGS method, which is probably the most widely used QN method to date.

Diagonal QN

Diagonal QN methods store $B$ as a diagonal matrix, therefore they only require $n$ memory, and no matrix multiplications or linear solves.

Diagonal QN can’t use the secant equation. If we take the secant equation $B\mathbf{s}=\mathbf{y}$ and constrain $B$ to be diagonal, the only solution is $B = \mathbf{y}/\mathbf{s}$, i.e. element-wise division of $\mathbf{y}$ by $\mathbf{s}$. This is extremely unstable and basically doesn’t work.

Instead diagonal QN use other equations, such as weak secant equation, weak quasi-cauchy equations, etc.

Current diagonal QN methods haven’t seen widespread adoption, probably because they converge slower than limited-memory, but they are also even cheaper so they might have some use cases.

Randomized/Greedy QN

QN methods take $\mathbf{s}_t$ to be difference between consequtive parameters so that we have easy access to $\mathbf{y}_t = H(\mathbf{x}_t)\mathbf{s}_t \approx \mathbf{g}(\mathbf{x}_t) - \mathbf{g}(\mathbf{x}_{t-1})$ approximation, but other choices of $\mathbf{s}_t$ are possible. The main disadvantage of the approximation is that it depends on difference between consecutive gradients, which is too noisy when objective function is stochastic. A simple alternative is to sample a random (or some other) vector $\mathbf{s}$, and explicitly compute $\mathbf{y}$ as hessian-vector product with $\mathbf{s}$, either via finite-difference formula or Pearlmutter’s trick. This is, for example, used in Stochastic Quasi-Newton method (SQN).

Moreover it is possible to choose $\mathbf{s}$ such that hessian-vector product with it gives us as much info about the hessian as possible, doing so is called greedy quasi-Newton update.

Notes

For some diabolical reason in Quasi-Newton literature hessian inverse approximation is often denoted as $H$. To avoid confusing it with hessian I wrote it as $B^{-1}$. This is also not perfect but it’s also used in some QN literature and I think it’s less confusing.

This post is licensed under CC BY 4.0 by the author.