’1

xn+1 = xn ’ Hn ∇f (xn )

for some sequence of nonsingular matrices Hn . Then if xn ’ x— q-linearly, xn = x— for any

n, and (4.19) holds then xn ’ x— q-superlinearly.

Proof. We begin by invoking (4.10) to obtain

En sn = (Hn ’ I)sn = (Hn ’ I)(yn ’ ∆2 s) = En yn + O( sn 2 ).

’1 ’1

Convergence of xn to x— implies that sn ’ 0 and hence (4.19) can be written as

En y n

lim = 0,

(4.21)

sn

n’∞

where yn = ∇f (xn+1 ) ’ ∇f (xn ).

Now let σ be the q-factor for the sequence {xn }. Clearly

(1 ’ σ) en ¤ sn ¤ (1 + σ) en .

Hence (4.21) is equivalent to

En y n

lim = 0.

(4.22)

en

n’∞

Since Hn ∇f (xn ) = ’sn and sn = yn + O( sn 2 ) we have

’1

’1

En y n = (Hn ’ I)(∇f (xn+1 ) ’ ∇f (xn ))

= Hn ∇f (xn+1 ) + sn ’ yn = Hn ∇f (xn+1 ) + O( sn 2 )

’1 ’1

’1 2

+ sn 2 ) = Hn en+1 + O( en 2 ).

’1

= Hn en+1 + O( en

Therefore, (4.22) implies that

’1

En y n Hn en+1 en+1

+ O( en ) ≥ M ’1

= + O( en ) ’ 0

en en en

as n ’ ∞, proving q-superlinear convergence.

For the remainder of this section we assume that (4.15) holds and that δ is small enough so

that the conclusions of Theorem 4.1.7 hold for some σ ∈ (0, 1). An immediate consequence of

this is that ∞

sn < ∞.

(4.23)

n=0

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

BFGS METHOD 77

The Frobenius norm of a matrix A is given by

N

2

(A)2 .

A =

(4.24) F ij

i,j=1

It is easy to show that (see exercise 4.5.5) for any unit vector v ∈ RN ,

A(I ’ vv T ) 2 2 2

and (I ’ vv T )A 2 2

¤A ’ Av ¤A F.

(4.25) F F F

We have, using (4.6), (4.7), and (4.25), that

2 2 2 2 2

En+1 ¤ En ’ En wn + O( sn ) = (1 ’ θn ) En + O( sn ),

(4.26) F F F

where wn = sn / sn and

±

En wn

if En = 0,

En F

θn =

1 if En = 0.

Using (4.23) we see that for any k ≥ 0,

k k

2 2 2 2

θ n En ¤ En ’ En+1 + O(1)

F F F

n=0 n=0

2 2

= E0 ’ Ek+1 + O(1) < ∞.

F F

Hence θn En ’ 0.

F

However, ±

En wn if En = 0

θ n En =

F

0 if En = 0

En sn

= En wn = .

sn

Hence (4.19) holds. This completes the proof of Theorem 4.1.3.

4.1.2 Global Theory

If one uses the BFGS model Hessian in the context of Algorithm optarm, then Theorem 3.2.4

can be applied if the matrices {Hk } remain bounded and well conditioned. However, even if a

limit point of the iteration is a minimizer x— that satis¬es the standard assumptions, Theorem 3.2.4

does not guarantee that the iteration will converge to that point. The situation in which x is near

x— but H is not near ∇2 f (x— ) is, from the point of view of the local theory, no better than that

when x is far from x— . In practice, however, convergence (often superlinear) is observed. The

result in this section is a partial explanation of this.

Our description of the global theory, using the Armijo line search paradigm from Chapter 3, is

based on [43]. We also refer the reader to [221], [45], and [269] for older results with a different

line search approach. Results of this type require strong assumptions on f and the initial iterate

x0 , but the reward is global and locally superlinear convergence for a BFGS“Armijo iteration.

Assumption 4.1.1. The set

D = {x | f (x) ¤ f (x0 )}

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

78 ITERATIVE METHODS FOR OPTIMIZATION

is convex and f is Lipschitz twice continuously differentiable in D. Moreover, there are »+ ≥

»’ > 0 such that

σ(∇2 f (x)) ‚ [»’ , »+ ]

for all x ∈ D.

Assumption 4.1.1 implies that f has a unique minimizer x— in D and that the standard

assumptions hold near x— .

Theorem 4.1.9. Let Assumption 4.1.1 hold and let H0 be spd. Then the BFGS“Armijo

iteration converges q-superlinearly to x— .

The results for local and global convergence do not completely mesh. An implementation

must allow for the fact that Assumption 4.1.1 may fail to hold, even near the root, and that

y T s ¤ 0 is a possibility when far from the root.

4.2 Implementation

The two implementation issues that we must confront are storage of the data needed to maintain

the updates and a strategy for dealing with the possibility that y T s ¤ 0. We address the

storage question in §4.2.1. For the second issue, when y T s is not suf¬ciently positive, we restart

the BFGS update with the identity. We present the details of this in §4.2.2. Our globalization

approach, also given in §4.2.2, is the simplest possible, the Armijo rule as described in Chapter 3.

We choose to discuss the Armijo rule in the interest of simplicity of exposition. However,

while the Armijo rule is robust and suf¬cient for most problems, more complex line search

schemes have been reported to be more ef¬cient [42], and one who seeks to write a general

purpose optimization code should give careful thought to the best way to globalize a quasi-

Newton method. In the case of BFGS, for example, one is always seeking to use a positive de¬nite

quadratic model, even in regions of negative curvature, and in such regions the approximate

Hessian could be reinitialized to the identity more often than necessary.

4.2.1 Storage

For the present we assume that y T s > 0. We will develop a storage-ef¬cient way to compute

the BFGS step using the history of the iteration rather than full matrix storage.

The implementation recommended here is one of many that stores the history of the iteration

’1

and uses that information recursively to compute the action of Hk on a vector. This idea was

suggested in [16], [186], [206], and other implementations may be found in [44] and [201].

All of these implementations store the iteration history in the pairs {sk , yk } and we present a

concrete example in Algorithm bfgsrec. A better, but somewhat less direct, way is based on

the ideas in [91] and [275] and requires that only a single vector be stored for each iteration. We

’1

assume that we can compute the action of H0 on a vector ef¬ciently, say, by factoring H0 at the

outset of the iteration or by setting H0 = I. We will use the BFGS formula from Lemma 4.1.1.

One way to maintain the update is to store the history of the iteration in the sequences of

vectors {yk } and {sk } where

sk = xk+1 ’ xk and yk = ∇f (xk+1 ) ’ ∇f (xk ).

If one has done this for k = 0, . . . , n ’ 1, one can compute the new search direction

’1

dn = ’Hn ∇f (xn )

by a recursive algorithm which applies (4.5).

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

BFGS METHOD 79

’1

Algorithm bfgsrec overwrites a given vector d with Hn d. The storage needed is one

vector for d and 2n vectors for the sequences {sk , yk }n’1 . A method for computing the product

k=0

’1

of H0 and a vector must also be provided.

’1

Algorithm 4.2.1. bfgsrec(n, {sk }, {yk }, H0 , d)

’1

1. If n = 0, d = H0 d; return

2. ± = sT d/yn’1 s; d = d ’ ±yn’1

T

n’1

’1

3. call bfgsrec(n ’ 1, {sk }, {yk }, H0 , d)

T T

4. d = d + (± ’ (yn’1 d/yn’1 sn’1 ))sn’1

Algorithm bfgsrec has the great advantage, at least in a language that ef¬ciently supports

recursion, of being very simple. More complex, but nonrecursive versions, have been described

in [16], [201], and [44].

The storage cost of two vectors per iteration can be signi¬cant, and when available storage is

exhausted one can simply discard the iteration history and restart with H0 . This approach, which

we implement in the remaining algorithms in this section, takes advantage of the fact that if H0

’1

is spd then ’H0 ∇f (x) will be a descent direction, and hence useful for a line search. Another

approach, called the limited memory BFGS [44], [207], [176], [201], keeps all but the oldest

(s, y) pair and continues with the update. Neither of these approaches for control of storage,

while essential in practice for large problems, has the superlinear convergence properties that

the full-storage algorithm does.

At a cost of a modest amount of complexity in the formulation, we can reduce the storage

cost to one vector for each iteration. The method for doing this in [275] begins with an expansion

of (4.5) as

’1

H+ = Hc + ±0 sc sT + β0 ((Hc yc )sT + sc (Hc yc )T ),

’1 ’1 ’1

c c

where

T T ’1

yc sc + yc Hc yc ’1

±0 = and β0 = T .

(yc sc )2

T yc sc

Now note that

’1 ’1 ’1 ’1

Hc yc = Hc ∇f (x+ ) ’ Hc ∇f (xc ) = Hc ∇f (x+ ) + sc /»c

and obtain

’1

H+ = Hc + ±1 sc sT + β0 (sc (Hc ∇f (x+ ))T + (Hc ∇f (x+ ))sT ),

’1 ’1 ’1

(4.27) c c

where

±1 = ±0 + 2β0 /»c .

Also

’1

d+ = ’H+ ∇f (x+ )

T

yc sT sc sT ∇f (x+ )

sc yc c c

’1

=’ I’ Hc I’ ∇f (x+ ) ’

(4.28) T Ts T

yc sc yc c yc sc

’1

= Ac sc + Bc Hc ∇f (x+ ),

where

T

yc sT sT ∇f (x+ )

yc

I’ Tc c

’1

Ac = Hc ∇f (x+ ) +

(4.29) T T

yc sc yc sc »c yc sc

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

80 ITERATIVE METHODS FOR OPTIMIZATION

and

sT ∇f (x+ )

Bc = ’1 + c T .

(4.30)

yc sc

’1

At this point we can compute d+ , and therefore »+ and s+ using only Hc ∇f (xc ). We do not

need H+ at all. We can now form H+ with the new data for the next iterate and will show that

we do not need to store the vectors {yk }.

Since (verify this!) Bc = 0, we have

s+ Ac sc

’1

Hc ∇f (x+ ) = ’ + .

Bc »+ Bc

Combining this with (4.27) gives

’1

H+ = Hc + ±c sc sT + βc (sc sT + s+ sT ),

’1

(4.31) c + c

where

β0

±c = ±1 + 2β0 Ac /Bc and βc = ’ .

(4.32)

Bc »+

This leads to the expansion

n

’1 ’1

±k sk sT + βk (sk sT + sk+1 sT ).

Hn+1 = H0 +

(4.33) k k+1 k

k=0

Upon re¬‚ection the reader will see that this is a complete algorithm. We can use (4.28) and Hn

to compute dn+1 . Then we can compute »n+1 and sn+1 and use them and (4.32) to compute

’1

±n and βn . This new data can be used to form Hn+1 with (4.33), which we can use to compute

dn+2 and continue the iteration.

In this way only the steps {sk } and the expansion coef¬cients {±k } and {βk } need be stored.

Algorithm bfgsopt is an implementation of these ideas.

Algorithm 4.2.2. bfgsopt(x, f, )

1. g = ’∇f (x), n = 0.

2. While g >

’1

(a) If n = 0, dn = ’H0 g

otherwise compute A, B, and dn using (4.28), (4.29), and (4.30).

(b) Compute »n , sn , and x = xn+1 with the Armijo rule.

(c) If n > 0 compute ±n’1 and βn’1 using (4.32).

(d) g = ’∇f (x), n = n + 1.

4.2.2 A BFGS“Armijo Algorithm

In this section we present a simple implementation that shows how the theoretical results can

be applied in algorithm design. Let H+ GS be the BFGS update from Hc and de¬ne the two

BF

modi¬ed BFGS (MBFGS) updates by

H+ GS

BF

if y T s > 0,

H+ =

(4.34)

if y T s ¤ 0,

I

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

BFGS METHOD 81

and

H+ GS

BF

if y T s > 0,

H+ =

(4.35)

if y T s ¤ 0.

Hc

In the MBFGS1 method, (4.34), the model Hessian is reinitialized to I if y T s ¤ 0. In the

early phase of this iteration, where ∇2 f may have negative eigenvalues, y T s ¤ 0 is certainly

possible and the search direction could be the steepest descent direction for several iterations.

An MBFGS2 step (4.35) keeps the history of the iteration even if y T s ¤ 0. One view

is that this approach keeps as much information as possible. Another is that once y T s ¤ 0,

the iteration history is suspect and should be thrown away. Both forms are used in practice.

Our MATLAB code bfgswopt uses MFBGS1 and maintains an approximation to H ’1 using

Algorithm bfgsopt. We also guard against poor scaling by using (3.50).

4.3 Other Quasi-Newton Methods

The DFP (Davidon, Fletcher, Powell) update [71], [72], [105]

(y ’ Hc s)y T + y(y ’ Hc s)T [(y ’ Hc s)T y]yy T

H + = Hc + ’

(4.36)

yT s (y T s)2

has similar local convergence properties to BFGS but does not perform as well in practice [224],

[225].

Two updates that preserve symmetry, but not de¬niteness, are the PSB (Powell symmetric

Broyden) update [219],