(y ’ Hc s)sT + s(y ’ Hc s)T [sT (y ’ Hc s)]ssT

H + = Hc + ’ ,

(4.37)

sT s (sT s)2

and the symmetric rank-one (SR1) [35] update,

(y ’ Hc s)(y ’ Hc s)T

H+ = Hc + .

(4.38)

(y ’ Hc s)T s

By preserving the symmetry of the approximate Hessians, but not the positive de¬niteness,

these updates present a problem for a line search globalization but an opportunity for a trust

region approach. The SR1 update has been reported to outperform BFGS algorithms in certain

cases [165], [41], [64], [65], [163], [258], [118], [250], [119], [268], [164], in which either the

approximate Hessians can be expected to be positive de¬nite or a trust region framework is used

[41], [64], [65].

One may update the inverse of the SR1 approximate Hessian using the Sherman“Morrison

formula, (4.39), a simple relation between the inverse of a nonsingular matrix and that of a

rank-one update of that matrix [93], [239], [240], [14].

Proposition 4.3.1. Let H be a nonsingular N —N matrix and let u, v ∈ RN . Then H +uv T

is invertible if and only if 1 + v T H ’1 u = 0. In this case

(H ’1 u)v T

(H + uv T )’1 = H ’1 .

I’

(4.39)

1 + v T H ’1 u

The proof is simply a direct veri¬cation of (4.39).

The SR1 algorithm terminates in ¬nitely many iterations for convex quadratic optimization

problems [101]. Since the denominator (y ’ Hc s)T s could vanish, the update could completely

fail and implementations must examine the denominator and take appropriate action if it is too

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

82 ITERATIVE METHODS FOR OPTIMIZATION

small. This update does not enforce or require positivity of the approximate Hessian and has

been used effectively to exploit negative curvature in a trust region context [165], [41].

For overdetermined nonlinear least squares problems one can try to approximate the second-

order term in ∇2 f while computing R T R exactly. Suppose

∇2 f (x) ≈ H = C(x) + A,

where the idea is that C, the computed part, is signi¬cantly easier to compute than A, the

approximated part. This is certainly the case for nonlinear least squares, where C = R T R . A

quasi-Newton method that intends to exploit this structure will update A only; hence

H+ = C(x+ ) + A+ .

Superlinear convergence proofs require, in one way or another, that H+ s = y. Therefore, in

terms of A, one might require the update to satisfy

A+ s = y # = y ’ C(x+ )s.

(4.40)

The de¬nition of y # given in (4.40) is called the default choice in [87]. This is not the only

choice for y # , and one can prove superlinear convergence for this and many other choices [87],

[84]. This idea, using several different updates, has been used in other contexts, such as optimal

control [159], [164].

An algorithm of this type, using SR1 to update A and a different choice for y # , was suggested

in [20] and [21]. The nonlinear least squares update from [77], [78], and [84] uses a DFP update

and yet another y # to compute A+ ,

T T

(y # ’ Ac s)y # + y # (y # ’ Ac s)T [(y # ’ Ac s)T y # ]y # y #

(4.41) A+ = Ac + ’ .

T T

y# s (y # s)2

The application of this idea to large-residual least squares problems is not trivial, and scaling

issues must be considered in a successful implementation.

Our proof of superlinear convergence can be applied to updates like (4.41). We state a special

case of a result from [87] for the BFGS formulation

T

y# y# (Ac s)(Ac s)T

A+ = Ac + ’ .

(4.42) T sT Ac s

y# s

Theorem 4.3.2. Let the standard assumptions hold and assume that

A— = ∇2 f (x— ) ’ C(x— )

is spd. Then there is δ such that if

x0 ’ x— ¤ δ and A0 ’ A— ¤ δ,

then the quasi-Newton iterates de¬ned by (4.42) exist and converge q-superlinearly to x— .

This result can be readily proved using the methods in this chapter (see [159]).

Quasi-Newton methods can also be designed to take into account special structure, such as

the sparsity pattern of the Hessian. One can update only those elements that are nonzero in the

initial approximation to the Hessian, requiring that the secant equation Hs = y holds. Such

updates have been proposed and analyzed in varying levels of generality in [83], [87], [185],

[238], [256], and [255].

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

BFGS METHOD 83

Another approach is to use the dependency of f on subsets of the variables, a structure that is

often present in discretizations of in¬nite-dimensional problems where coef¬cients of operators

can be updated rather than entire matrix representations of those operators. We refer the reader

to [133], [131], and [132] for an algebraic viewpoint based on ¬nite-dimensional analysis and

to [159], [157], [164], [160], and [163] for an operator theoretic description of these methods.

When applied to discretizations of in¬nite-dimensional optimization problems, quasi-Newton

methods perform best when they also work well on the in¬nite-dimensional problem itself. Work

on BFGS in Hilbert space can be found, for example, in [135], [158], and [159].

Quasi-Newton methods have been designed for underdetermined problems [184], and Broy-

den™s method itself has been applied to linear least squares problems [111], [148].

4.4 Examples

The computations in this section were done with the MATLAB code bfgswopt. For the small

parameter ID problem, where evaluation of f is far more expensive than the cost of maintaining

or factoring the (very small!) approximate Hessian, one could also use a brute force approach

in which H is updated and factored anew with each iteration.

4.4.1 Parameter ID Problem

We solve the parameter ID problem with the same data as in §3.4.1 using H0 = I as the initial

Hessian. We compare the BFGS solution with the Gauss“Newton iteration from §3.4.1. From

Figure 4.1 one can see the local superlinear convergence and the good performance of the line

search. However, as one should expect, the Gauss“Newton iteration, being designed for small

residual least squares problems, was more ef¬cient. The Gauss“Newton iteration required 14

function evaluations, 6 gradients, and roughly 1.3 million ¬‚oating point operations, while the

BFGS“Armijo iteration needed 29 function evaluations, 15 gradients, and 3.8 million ¬‚oating

point operations.

4.4.2 Discrete Control Problem

We return to the example from §3.4.2. For our ¬rst example we use the initial iterate

u0 (t) = 10.

BFGS also requires an initial approximation to the Hessian and we consider two such approxi-

mations:

Hp = .25I and Hg = I.

(4.43)

The Hessian for the continuous problem is a compact perturbation of the identity and the

theory from [158] and [135] indicates that Hg is a much better approximate Hessian than Hp .

The results in Figure 4.2 support that idea. For the better Hessian, one can see the concavity

of superlinear convergence in the plot of the gradient norm. The computation for the better

Hessian required 12 iterations and roughly 572 thousand ¬‚oating point operations, while the one

with the poor Hessian took 16 iterations and roughly 880 thousand ¬‚oating point operations.

Stepsize reductions were not required for the good Hessian and were needed four times during

the iteration for the poor Hessian. However, the guard against poor scaling (3.50) was needed

in both cases.

When we used the same poor initial iterate that we used in §3.4.2

u0 (t) = 5 + 300 sin(20πt)

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

84 ITERATIVE METHODS FOR OPTIMIZATION

Gauss Newton Gauss Newton

2 5

10 10

0 0

10 10

Function Value

Gradient Norm

2 5

10 10

4 10

10 10

6 15

10 10

0 2 4 6 0 2 4 6

Iterations Iterations

BFGS BFGS

5 5

10 10

0

10

Function Value

Gradient Norm

0 5

10 10

10

10

5 15

10 10

0 5 10 15 0 5 10 15

Iterations Iterations

Figure 4.1: BFGS“Armijo and Gauss“Newton for the Parameter ID Problem

Better Hessian Better Hessian

5 5

10 10

Function Value

Gradient Norm

0

10

4

10

5

10

10 3

10 10

0 5 10 15 0 5 10 15

Iterations Iterations

Poor Hessian Poor Hessian

5 5

10 10

Function Value

Gradient Norm

0

10

4

10

5

10

10 3

10 10

0 5 10 15 20 0 5 10 15 20

Iterations Iterations

Figure 4.2: BFGS“Armijo for Discrete Control Problem

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

BFGS METHOD 85

and allocated 50 vectors to Algorithm bfgsopt, there was no longer a bene¬t to using the

good Hessian. In fact, as is clear from Figure 4.3 the poor Hessian produced a more rapidly

convergent iteration.

Better Hessian Better Hessian

10 8

10 10

5

10

Function Value

Gradient Norm

6

10

0

10

4

10

5

10

10 2

10 10

0 50 100 150 0 50 100 150

Iterations Iterations

Poor Hessian Poor Hessian

10 8

10 10

5

10

Function Value

Gradient Norm

6

10

0

10

4

10

5

10

10 2

10 10

0 20 40 60 80 0 20 40 60 80

Iterations Iterations

Figure 4.3: BFGS“Armijo for Discrete Control Problem: Poor Initial Iterate

4.5 Exercises on BFGS

4.5.1. Use the secant method (4.2) with initial data x’1 = 1 and x0 = .9 to minimize f (x) = x4 .

Explain the convergence of the iteration.

4.5.2. Prove Lemma 4.1.1. It might help to use the secant equation.

4.5.3. Prove Lemma 4.1.4.

4.5.4. Verify (4.13) and compute the constant C.

4.5.5. Prove (4.25).

4.5.6. As an exercise in character building, implement Algorithm bfgsrec nonrecursively.

4.5.7. Show how the Sherman“Morrison formula can be used to implement the SR1 update in

such a way that only one vector need be stored for each iterate.

4.5.8. State and prove a local convergence theorem for DFP and/or PSB.

4.5.9. Implement the DFP and PSB update and compare their performance with BFGS on the

examples from §4.4.

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

86 ITERATIVE METHODS FOR OPTIMIZATION

4.5.10. Show that, for positive de¬nite quadratic problems, the BFGS method with an exact line

search (i.e., one that ¬nds the minimum of f in the search direction) is the same as CG

[201], [200].

4.5.11. Prove Theorem 4.3.2.

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

Chapter 5

Simple Bound Constraints