Simple explanation of Quasi-Newton methods
Simple explanation of the secant equation and how quasi-newton methods use it.
Preface
At time step $t$ the parameters will be denoted as $\mathbf{x}_t$; gradient at those parameters as $\mathbf{g}(\mathbf{x}_t)$ and Hessian as $H(\mathbf{x}_t)$.
For this article I assume the reader knows what gradient and Hessian are, and is familiar with Newton’s method for optimization, which in the most basic form performs the following update: $$ \mathbf{x}_{t+1} \leftarrow \mathbf{x}_t - H(\mathbf{x}_t)^{-1}\mathbf{g}(\mathbf{x}_t) $$
Introduction
Quasi-newton methods approximate the Hessian or Hessian inverse using only gradients. Let’s see how they do that!
Secant equation
Secant equation is a core concept in Quasi-Newton methods.
Suppose we have Hessian $H$, some vector $\mathbf{s}$, and we compute a Hessian-vector product with it and call it $\mathbf{y}$. So we get: $$ H\mathbf{s} = \mathbf{y} $$ This is the secant equation. If we had $\mathbf{s}$ and $\mathbf{y}$, we would know that the Hessian must satisfy $H\mathbf{s} = \mathbf{y}$.
It might seem that in order to obtain $\mathbf{y}$, one needs the Hessian to compute $H\mathbf{s}$. But there is a simple way to get $\mathbf{s}$ and $\mathbf{y}$ basically for free without knowing the Hessian.
Finite-difference Hessian-vector products
If we compute difference between consequtive parameters and assign it to $\mathbf{s}_t$, and assign difference between consequtive gradients to $\mathbf{y}_t$, then $\mathbf{y}_t$ approximates Hessian-vector product with $\mathbf{s}_t$: $$ \mathbf{s}_t = \mathbf{x}_t - \mathbf{x}_{t-1} $$ $$ \mathbf{y}_t = \mathbf{g}(\mathbf{x}_t) - \mathbf{g}(\mathbf{x}_{t-1}) $$ $$ \text{then: } H(\mathbf{x}_t) \mathbf{s_t} \approx \mathbf{y}_t $$
To understand why, we can look at the formula of a finite-differerence approximation to a Hessian-vector product.
A Hessian-vector product, such as $H(\mathbf{x})\mathbf{s}$, can be approximated using this (backward finite difference) formula: $$ H(\mathbf{x})\mathbf{s} \approx \frac{\mathbf{g}(\mathbf{x}) - \mathbf{g}(\mathbf{x} - r\mathbf{s})}{r} $$
Here $\mathbf{x}$ is the parameters where Hessian-vector product is approximated, and $r$ is a small constant. With infinite precision the approximation becomes exact in the limit $r \rightarrow 0$.
Now let’s set $r$ to 1 and replace $\mathbf{s}$ with $\mathbf{x}_t - \mathbf{x}_{t-1}$: $$ H(\mathbf{x}_t) \mathbf{s}_t \approx \frac{\mathbf{g}(\mathbf{x}_t) - \mathbf{g}(\mathbf{x}_t - 1 (\mathbf{x}_t - \mathbf{x}_{t-1}))}{1} = \mathbf{g}(\mathbf{x}_t) - \mathbf{g}(\mathbf{x}_{t-1}) = \mathbf{y}_t $$ so: $$ H(\mathbf{x}_t) \mathbf{s}_t \approx \mathbf{y}_t $$
Of course by setting $r$ to 1, it seems that the approximation would be crude, but in practice it works exceptionally well in Quasi-Newton methods.
Least-change principle
Quasi-Newton methods maintain a hessian approximation $B$ (or it’s inverse) and refine it on each iteration.
Suppose we have current Hessian approximation $B_{t-1}$, and we obtain $\mathbf{s}_t = \mathbf{x}_t - \mathbf{x}_{t-1}$ and $\mathbf{y}_t = \mathbf{g}(\mathbf{x}_t) - \mathbf{g}(\mathbf{x}_{t-1})$. We know that $H(\mathbf{x}_t) \mathbf{s}_t \approx \mathbf{y}_t$, therefore we want our new refined hessian approximation $B_t$ to satisfy $B_t \mathbf{s}_t = \mathbf{y}_t$. However on it’s own this isn’t too useful as there are infinitely many possible $B_t$ that satisfy $B_t \mathbf{s}_t = \mathbf{y}_t$.
The idea of the least-change principle is to pick $B_t$ such that it is close to $B_{t-1}$, preferably as close as possible. Since $B_{t-1}$ might’ve accumulated a lot of curvature information from previous iterations, we don’t want to throw that information away - keeping $B_t$ close to $B_{t-1}$ makes sure as little of that information is lost as possible. “Close” can be interpreted in many different ways, leading to different Quasi-Newton formulas.
Least-change principle is a heuristic (to my knowledge), but it works exceptionally well in practice and converges to true Hessian. There are multiple interpretations of why it works, for example:
Lukšan, L., & Spedicato, E. (2000). Variable metric methods for unconstrained optimization and nonlinear least squares. Journal of Computational and Applied Mathematics, 124(1-2), 61-95.:
Roughly speaking, the least-change principle guarantees that as much information from previous iterations as possible is saved while the quasi-Newton condition brings new information because it is satisfied by matrix.
Let us ask for the “best” correction in some sense. There are many possible choices to make, but a good one is to ask for the smallest correction, in the sense of some norm. To a certain extent, this would tend to keep the elements of [$B^{−1}_k$ ] from growing too large, which might cause an undesirable instability.
Deriving a Quasi-Newton method
Now let’s derive a Quasi-Newton method. We will derive the Symmetric Rank 1 (SR1) method because it is simple.
What we have:
- current Hessian approximation $B_{t-1}$;
- $\mathbf{s}_t = \mathbf{x}_t - \mathbf{x}_{t-1}$
- $\mathbf{y}_t = \mathbf{g}(\mathbf{x}_t) - \mathbf{g}(\mathbf{x}_{t-1})$;
What we want:
- $B_t$ that satisfies the secant equation $B_t\mathbf{s}_t=\mathbf{y}_t$
- we want $B_t$ to be close to $B_{t-1}$ in some sense, so that little information is lost from $B_t$
- there are other useful constraints one can impose on $B_t$:
- usually $B_t$ is constrained to be symmetric: $B_t=B_t^{\text{T}}$.
- $B_t$ is often constrained to be positive-definite, however the SR1 formula does not do that.
A simple way to keep $B_t$ symmetric and not too far from $B_{t-1}$ is to perform a symmetric rank-1 update: $B_t = B_{t-1} + \alpha\mathbf{u}\mathbf{u}^{\text{T}}$, where $\alpha \mathbf{u}\mathbf{u}^{\text{T}}$ is a symmetric rank-1 correction which we can try to derive a formula for.
Let’s plug that into the secant equation $B_t\mathbf{s_t} = \mathbf{y_t}$, replacing $B_t$ with $B_{t-1} + \alpha\mathbf{u}\mathbf{u}^{\text{T}}$. I leave the entire derivation there, but basically there is a single unique solution.
$$(B_{t-1} + \alpha\mathbf{u}\mathbf{u}^{\text{T}})\mathbf{s}_t = \mathbf{y}_t$$
$$ \begin{gather*} \text{(1) unpack brackets:} \\ \mathbf{y}_t = B_{t-1}\mathbf{s}_t + \alpha\mathbf{u}\mathbf{u}^{\text{T}} \mathbf{s}_t \end{gather*} $$
$$ \begin{gather*} \text{(2) rearrange:}\\ \alpha\mathbf{u}\mathbf{u}^{\text{T}} \mathbf{s}_t = \mathbf{y}_t - B_{t-1}\mathbf{s}_t \end{gather*} $$
$$ \begin{gather*} \text{(3) transpose both sides:}\\ (\mathbf{y}_t - B_{t-1}\mathbf{s}_t)^{\text{T}} = \alpha \mathbf{s}_t^{\text{T}}\mathbf{u}\mathbf{u}^{\text{T}} \end{gather*} $$
$$ \begin{gather*} \text{(4) multiply (2) by (3):}\\ (\mathbf{y}_t - B_{t-1}\mathbf{s}_t)(\mathbf{y}_t - B_{t-1}\mathbf{s}_t)^{\text{T}} = (\alpha \mathbf{u} \mathbf{u}^{\text{T}} \mathbf{s}_t) (\alpha \mathbf{s}_t^{\text{T}} \mathbf{u} \mathbf{u}^{\text{T}}) \end{gather*} $$
$$ \begin{gather*} \text{(5) rearrange right-hand side:}\\ (\mathbf{y}_t - B_{t-1}\mathbf{s}_t)(\mathbf{y}_t - B_{t-1}\mathbf{s}_t)^{\text{T}} = \alpha^2 \cdot (\mathbf{u}^{\text{T}} \mathbf{s}_t) \cdot (\mathbf{s_t}^{\text{T}}\mathbf{u}) \cdot \mathbf{u} \mathbf{u}^{\text{T}} = \alpha^2 (\mathbf{u}^{\text{T}} \mathbf{s}_t)^2 \mathbf{u} \mathbf{u}^{\text{T}} \end{gather*} $$
$$ \begin{gather*} \text{(6) divide by } \alpha \cdot (\mathbf{u}^{\text{T}} \mathbf{s}_t)^2:\\ \alpha \mathbf{u} \mathbf{u}^{\text{T}} = \frac{(\mathbf{y}_t - B_{t-1}\mathbf{s}_t)(\mathbf{y}_t - B_{t-1}\mathbf{s}_t)^{\text{T}}}{\alpha \cdot (\mathbf{u}^{\text{T}} \mathbf{s}_t)^2} \end{gather*} $$
Second part of the derivation is to get rid of $(\mathbf{u}^{\text{T}} \mathbf{s}_t)^2$ in the denominator:
$$(B_{t-1} + \alpha\mathbf{u}\mathbf{u}^{\text{T}})\mathbf{s}_t = \mathbf{y_t}$$
$$ \begin{gather*} \text{(7) left-multiply by } {\mathbf{s}_t^{\text{T}}}:\\ \mathbf{s}_t^{\text{T}}\mathbf{y}_t = \mathbf{s}_t^{\text{T}} B_{t-1} \mathbf{s_t} + \mathbf{s}_t^{\text{T}} (\alpha\mathbf{u}\mathbf{u}^{\text{T}} \mathbf{s}_t) \end{gather*} $$
$$ \mathbf{s}_t^{\text{T}}\mathbf{y}_t = \mathbf{s}_t^{\text{T}} B_{t-1} \mathbf{s_t} + \alpha(\mathbf{s_t}^{\text{T}}\mathbf{u})(\mathbf{u}^{\text{T}} \mathbf{s}_t) $$
$$ \mathbf{s}_t^{\text{T}}\mathbf{y}_t = \mathbf{s}_t^{\text{T}} B_{t-1} \mathbf{s_t} + \alpha(\mathbf{u}^{\text{T}} \mathbf{s}_t)^2 $$
$$ \begin{gather*}\text{(8) rearrange:}\\ (\mathbf{u}^{\text{T}} \mathbf{s}_t)^2 = \mathbf{s}_t^{\text{T}} (\mathbf{y}_t - B_{t-1}\mathbf{s}_t) \end{gather*} $$
Now we put (8) into (6) and we get SR1 correction formula: $$ \alpha \mathbf{u} \mathbf{u}^{\text{T}} = \frac{(\mathbf{y}_t - B_{t-1}\mathbf{s}_t)(\mathbf{y}_t - B_{t-1}\mathbf{s}_t)^{\text{T}}}{\mathbf{s}_t^{\text{T}} (\mathbf{y}_t - B_{t-1}\mathbf{s}_t)} $$
So, given $B_{t-1}$, $\mathbf{s}_t$ and $\mathbf{y}_t$, the Hessian approximation is updated like this: $$ B_t \leftarrow B_{t-1} + \frac{(\mathbf{y}_t - B_{t-1}\mathbf{s}_t)(\mathbf{y}_t - B_{t-1}\mathbf{s}_t)^{\text{T}}}{\mathbf{s}_t^{\text{T}} (\mathbf{y}_t - B_{t-1}\mathbf{s}_t)} $$
We can also do the same for Hessian inverse. $B_t\mathbf{s}_t=\mathbf{y}_t$, therefore $B_t^{-1}\mathbf{y}_t=\mathbf{s}_t$. We can repeat the entire derivation above, simply swapping $\mathbf{s}_t$ and $\mathbf{y}_t$, and we get a formula which updates Hessian inverse approximation: $$ B_t^{-1} \leftarrow B_{t-1}^{-1} + \frac{(\mathbf{s}_t - B_{t-1}^{-1}\mathbf{y}_t)(\mathbf{s}_t - B_{t-1}^{-1}\mathbf{y}_t)^{\text{T}}}{\mathbf{y}_t^{\text{T}} (\mathbf{s}_t - B_{t-1}^{-1}\mathbf{y}_t)} $$
Note that swapping $\mathbf{s}_t$ and $\mathbf{y}_t$ works for SR1, but it doesn’t work for all formulas. For example BFGS and DFP methods are duals - if you swap $\mathbf{s}_t$ and $\mathbf{y}_t$ in BFGS formula for Hessian, you don’t get BFGS formula for inverse Hessian, instead you get DFP formula for inverse Hessian, and vice-versa.
Final algorithm
Now that we have a formula for updating Hessian or Hessian inverse approximation, let’s see how it is applid to minimize a function.
Initialization
First we need to initialize the Hessian approximation $B_0$. Usually it is initialized to a scaled identity matrix. It can also be initialized to the true Hessian matrix or some guess of it.
A heuristic to determine scale of the initial identity matrix is described in Wright, Stephen, and Jorge Nocedal. “Numerical optimization.” Springer Science 35.67-68 (1999): 7. on p.142, and it is the following: $$ B_1 = I \cdot \frac{\mathbf{y}_1^{\text{T}}\mathbf{y}_1}{\mathbf{y}_1^{\text{T}} \mathbf{s}_1} $$ and for Hessian inverse: $$ B_1^{-1} = I \cdot \frac{\mathbf{y}_1^{\text{T}}\mathbf{s}_1}{\mathbf{y}_1^{\text{T}} \mathbf{y}_1} $$
But for this formula we need $\mathbf{s}$ and $\mathbf{y}$ - differences between consequtive parameters and gradients, which are not available on the first step since we don’t yet have previous parameters and gradients. So on first step we set $B_0=I$, meaning first step is just a gradient descent step, and on second step we obtain our first $\mathbf{s_1}$ and $\mathbf{y_1}$ and set $B_1=I * \frac{\mathbf{y}_1^{\text{T}}\mathbf{y}_1}{\mathbf{y}_1^{\text{T}} \mathbf{s}_1}$.
This heuristic is actually the Barzilai–Borwein step size. Barzilai–Borwein method uses $I \cdot \frac{\mathbf{y}_1^{\text{T}}\mathbf{y}_1}{\mathbf{y}_1^{\text{T}} \mathbf{s}_1}$ as the hessian approximation. This scaling corresponds to the least-squares solution to $B^{-1}\mathbf{y}=\mathbf{s}$ where $B^{-1}$ is constrained to be scaled identity; it is the only solution, so least-change principle can’t be used here.
Performing optimization
There are multiple ways to perform optimization with a Quasi-Newton Hessian approximation, it can be used with a line search or a trust region. Using a fixed step size is less common as it isn’t stable enough. I must note that SR1 that we derived above tends to be unstable with a line search, and is much more suitable for trust region. The most popular Quasi-Newton formula, BFGS, works well with both line search and trust region. Here we will review the line search approach because it is simpler.
The algorithm for most Quasi-Newton methods is very similar. We have $B_{t-1}$ - current Hessian approximation, or $B_{t-1}^{-1}$ - current hessian inverse approximation. On time step $t$ it is the following:
$1.$ compute $\mathbf{s}_t$ - difference between current and previous parameters:
$$\mathbf{s}_t = \mathbf{x}_t - \mathbf{x}_{t-1}$$
$2.$ compute $\mathbf{y}_t$ - difference between current and previous gradients:
$$\mathbf{y}_t = \mathbf{g}(\mathbf{x}_t) - \mathbf{g}(\mathbf{x}_{t-1})$$
$3.$ Update Hessian approximation (or Hessian inverse approximation), using some Quasi-Newton formula. Maintaining a hessian inverse approximation is more efficient since we will need to compute $B_t^{-1} \mathbf{g}(\mathbf{x}_t)$. We could use the SR1 formula derived above, but since it doesn’t work well with line searches, let’s use the BFGS formula:
$$ B_{t}^{-1} \leftarrow B_{t-1}^{-1} + \frac{(\mathbf{s}_t^T \mathbf{y}_t + \mathbf{y}_t^T B_{t-1}^{-1} \mathbf{y}_t) \mathbf{s}_t \mathbf{s}_t^T}{(\mathbf{s}_t^T \mathbf{y}_t)^2} - \frac{B_{t-1}^{-1} \mathbf{y}_t \mathbf{s}_t^T + \mathbf{s}_t \mathbf{y}_t^T B_{t-1}^{-1}}{\mathbf{s}_t^T \mathbf{y}_t} $$
$4.$ Determine step size $\gamma_t$, usually via a line search along the Quasi-Newton direction $B_t^{-1} \mathbf{g}(\mathbf{x}_t)$:
$$\gamma_t = \underset{\gamma}{\text{argmin}} f(x_t - \gamma B_t^{-1} \mathbf{g}(\mathbf{x}_t))$$
$5.$ Update the parameters:
$$x_{t+1} \leftarrow x_t - \gamma_t B_t^{-1} \mathbf{g}(\mathbf{x}_t)$$
Further developments
I will now briefly outline some important modifications to Quasi-Newton (QN) methods. This is not an exhaustive list and I might update it in the future.
Limited-memory
Limited-memory variants are probably the most important and widely used modifications of QN methods. For a problem with $n$ variables standard QN uses at least $n^2$ memory to store $B$ or $B^{-1}$, so it can’t be applied to large scale problems.
Limited-memory variants store $B$ more efficiently and usually only consider a limited history of past $\mathbf{s}_t$ and $\mathbf{y}_t$ pairs.
Let’s say we’ve initialized $B$ to scaled identity and performed $k$ updates with the BFGS formula, where each update is a rank-2 correction. After $k$ rank-2 updates $B$ it becomes scaled identity plus at most rank-$2k$ symmetric matrix. This can be represented as $B = \alpha I + U U^T$, where $U$ is a $n \times 2k$ matrix. This representation is called compact representation and only requires $n \times 2k$ memory.
Alternatively it is possible to store past $k$ pairs of $\mathbf{s}_t$ and $\mathbf{y}_t$, and compute $B^{-1}\mathbf{g}(\mathbf{x}_t)$ via two loop recursion, never explicitly forming $B^{-1}$, this also only requires $n \times 2k$ memory. This is used, for example, by the L-BFGS method, which is probably the most widely used QN method to date.
Diagonal QN
Diagonal QN methods store $B$ as a diagonal matrix, therefore they only require $n$ memory, and no matrix multiplications or linear solves.
Diagonal QN can’t use the secant equation. If we take the secant equation $B\mathbf{s}=\mathbf{y}$ and constrain $B$ to be diagonal, the only solution is $B = \mathbf{y}/\mathbf{s}$, i.e. element-wise division of $\mathbf{y}$ by $\mathbf{s}$. This is extremely unstable and basically doesn’t work.
Instead diagonal QN use other equations, such as weak secant equation, weak quasi-cauchy equations, etc.
Current diagonal QN methods haven’t seen widespread adoption, probably because they converge slower than limited-memory, but they are also even cheaper so they might have some use cases.
Randomized/Greedy QN
QN methods take $\mathbf{s}_t$ to be difference between consequtive parameters so that we have easy access to $\mathbf{y}_t = H(\mathbf{x}_t)\mathbf{s}_t \approx \mathbf{g}(\mathbf{x}_t) - \mathbf{g}(\mathbf{x}_{t-1})$ approximation, but other choices of $\mathbf{s}_t$ are possible. The main disadvantage of the approximation is that it depends on difference between consecutive gradients, which is too noisy when objective function is stochastic. A simple alternative is to sample a random (or some other) vector $\mathbf{s}$, and explicitly compute $\mathbf{y}$ as hessian-vector product with $\mathbf{s}$, either via finite-difference formula or Pearlmutter’s trick. This is, for example, used in Stochastic Quasi-Newton method (SQN).
Moreover it is possible to choose $\mathbf{s}$ such that hessian-vector product with it gives us as much info about the hessian as possible, doing so is called greedy quasi-Newton update.
Notes
For some diabolical reason in Quasi-Newton literature hessian inverse approximation is often denoted as $H$. To avoid confusing it with hessian I wrote it as $B^{-1}$. This is also not perfect but it’s also used in some QN literature and I think it’s less confusing.