<< . .

. 14
( : 32)



. . >>

n=1
’1
xn+1 = xn ’ Hn ∇f (xn )

for some sequence of nonsingular matrices Hn . Then if xn ’ x— q-linearly, xn = x— for any
n, and (4.19) holds then xn ’ x— q-superlinearly.
Proof. We begin by invoking (4.10) to obtain

En sn = (Hn ’ I)sn = (Hn ’ I)(yn ’ ∆2 s) = En yn + O( sn 2 ).
’1 ’1


Convergence of xn to x— implies that sn ’ 0 and hence (4.19) can be written as
En y n
lim = 0,
(4.21)
sn
n’∞

where yn = ∇f (xn+1 ) ’ ∇f (xn ).
Now let σ be the q-factor for the sequence {xn }. Clearly

(1 ’ σ) en ¤ sn ¤ (1 + σ) en .

Hence (4.21) is equivalent to
En y n
lim = 0.
(4.22)
en
n’∞

Since Hn ∇f (xn ) = ’sn and sn = yn + O( sn 2 ) we have
’1

’1
En y n = (Hn ’ I)(∇f (xn+1 ) ’ ∇f (xn ))

= Hn ∇f (xn+1 ) + sn ’ yn = Hn ∇f (xn+1 ) + O( sn 2 )
’1 ’1


’1 2
+ sn 2 ) = Hn en+1 + O( en 2 ).
’1
= Hn en+1 + O( en
Therefore, (4.22) implies that
’1
En y n Hn en+1 en+1
+ O( en ) ≥ M ’1
= + O( en ) ’ 0
en en en
as n ’ ∞, proving q-superlinear convergence.
For the remainder of this section we assume that (4.15) holds and that δ is small enough so
that the conclusions of Theorem 4.1.7 hold for some σ ∈ (0, 1). An immediate consequence of
this is that ∞
sn < ∞.
(4.23)
n=0



Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




BFGS METHOD 77

The Frobenius norm of a matrix A is given by
N
2
(A)2 .
A =
(4.24) F ij
i,j=1


It is easy to show that (see exercise 4.5.5) for any unit vector v ∈ RN ,

A(I ’ vv T ) 2 2 2
and (I ’ vv T )A 2 2
¤A ’ Av ¤A F.
(4.25) F F F

We have, using (4.6), (4.7), and (4.25), that
2 2 2 2 2
En+1 ¤ En ’ En wn + O( sn ) = (1 ’ θn ) En + O( sn ),
(4.26) F F F

where wn = sn / sn and
±
En wn

 if En = 0,

En F
θn =



1 if En = 0.

Using (4.23) we see that for any k ≥ 0,
k k
2 2 2 2
θ n En ¤ En ’ En+1 + O(1)
F F F
n=0 n=0

2 2
= E0 ’ Ek+1 + O(1) < ∞.
F F

Hence θn En ’ 0.
F
However, ±
 En wn if En = 0
θ n En =
F

0 if En = 0
En sn
= En wn = .
sn
Hence (4.19) holds. This completes the proof of Theorem 4.1.3.

4.1.2 Global Theory
If one uses the BFGS model Hessian in the context of Algorithm optarm, then Theorem 3.2.4
can be applied if the matrices {Hk } remain bounded and well conditioned. However, even if a
limit point of the iteration is a minimizer x— that satis¬es the standard assumptions, Theorem 3.2.4
does not guarantee that the iteration will converge to that point. The situation in which x is near
x— but H is not near ∇2 f (x— ) is, from the point of view of the local theory, no better than that
when x is far from x— . In practice, however, convergence (often superlinear) is observed. The
result in this section is a partial explanation of this.
Our description of the global theory, using the Armijo line search paradigm from Chapter 3, is
based on [43]. We also refer the reader to [221], [45], and [269] for older results with a different
line search approach. Results of this type require strong assumptions on f and the initial iterate
x0 , but the reward is global and locally superlinear convergence for a BFGS“Armijo iteration.
Assumption 4.1.1. The set

D = {x | f (x) ¤ f (x0 )}


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




78 ITERATIVE METHODS FOR OPTIMIZATION

is convex and f is Lipschitz twice continuously differentiable in D. Moreover, there are »+ ≥
»’ > 0 such that
σ(∇2 f (x)) ‚ [»’ , »+ ]
for all x ∈ D.
Assumption 4.1.1 implies that f has a unique minimizer x— in D and that the standard
assumptions hold near x— .
Theorem 4.1.9. Let Assumption 4.1.1 hold and let H0 be spd. Then the BFGS“Armijo
iteration converges q-superlinearly to x— .
The results for local and global convergence do not completely mesh. An implementation
must allow for the fact that Assumption 4.1.1 may fail to hold, even near the root, and that
y T s ¤ 0 is a possibility when far from the root.


4.2 Implementation
The two implementation issues that we must confront are storage of the data needed to maintain
the updates and a strategy for dealing with the possibility that y T s ¤ 0. We address the
storage question in §4.2.1. For the second issue, when y T s is not suf¬ciently positive, we restart
the BFGS update with the identity. We present the details of this in §4.2.2. Our globalization
approach, also given in §4.2.2, is the simplest possible, the Armijo rule as described in Chapter 3.
We choose to discuss the Armijo rule in the interest of simplicity of exposition. However,
while the Armijo rule is robust and suf¬cient for most problems, more complex line search
schemes have been reported to be more ef¬cient [42], and one who seeks to write a general
purpose optimization code should give careful thought to the best way to globalize a quasi-
Newton method. In the case of BFGS, for example, one is always seeking to use a positive de¬nite
quadratic model, even in regions of negative curvature, and in such regions the approximate
Hessian could be reinitialized to the identity more often than necessary.

4.2.1 Storage
For the present we assume that y T s > 0. We will develop a storage-ef¬cient way to compute
the BFGS step using the history of the iteration rather than full matrix storage.
The implementation recommended here is one of many that stores the history of the iteration
’1
and uses that information recursively to compute the action of Hk on a vector. This idea was
suggested in [16], [186], [206], and other implementations may be found in [44] and [201].
All of these implementations store the iteration history in the pairs {sk , yk } and we present a
concrete example in Algorithm bfgsrec. A better, but somewhat less direct, way is based on
the ideas in [91] and [275] and requires that only a single vector be stored for each iteration. We
’1
assume that we can compute the action of H0 on a vector ef¬ciently, say, by factoring H0 at the
outset of the iteration or by setting H0 = I. We will use the BFGS formula from Lemma 4.1.1.
One way to maintain the update is to store the history of the iteration in the sequences of
vectors {yk } and {sk } where

sk = xk+1 ’ xk and yk = ∇f (xk+1 ) ’ ∇f (xk ).

If one has done this for k = 0, . . . , n ’ 1, one can compute the new search direction
’1
dn = ’Hn ∇f (xn )

by a recursive algorithm which applies (4.5).


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




BFGS METHOD 79

’1
Algorithm bfgsrec overwrites a given vector d with Hn d. The storage needed is one
vector for d and 2n vectors for the sequences {sk , yk }n’1 . A method for computing the product
k=0
’1
of H0 and a vector must also be provided.
’1
Algorithm 4.2.1. bfgsrec(n, {sk }, {yk }, H0 , d)
’1
1. If n = 0, d = H0 d; return
2. ± = sT d/yn’1 s; d = d ’ ±yn’1
T
n’1
’1
3. call bfgsrec(n ’ 1, {sk }, {yk }, H0 , d)
T T
4. d = d + (± ’ (yn’1 d/yn’1 sn’1 ))sn’1

Algorithm bfgsrec has the great advantage, at least in a language that ef¬ciently supports
recursion, of being very simple. More complex, but nonrecursive versions, have been described
in [16], [201], and [44].
The storage cost of two vectors per iteration can be signi¬cant, and when available storage is
exhausted one can simply discard the iteration history and restart with H0 . This approach, which
we implement in the remaining algorithms in this section, takes advantage of the fact that if H0
’1
is spd then ’H0 ∇f (x) will be a descent direction, and hence useful for a line search. Another
approach, called the limited memory BFGS [44], [207], [176], [201], keeps all but the oldest
(s, y) pair and continues with the update. Neither of these approaches for control of storage,
while essential in practice for large problems, has the superlinear convergence properties that
the full-storage algorithm does.
At a cost of a modest amount of complexity in the formulation, we can reduce the storage
cost to one vector for each iteration. The method for doing this in [275] begins with an expansion
of (4.5) as
’1
H+ = Hc + ±0 sc sT + β0 ((Hc yc )sT + sc (Hc yc )T ),
’1 ’1 ’1
c c

where
T T ’1
yc sc + yc Hc yc ’1
±0 = and β0 = T .
(yc sc )2
T yc sc
Now note that
’1 ’1 ’1 ’1
Hc yc = Hc ∇f (x+ ) ’ Hc ∇f (xc ) = Hc ∇f (x+ ) + sc /»c

and obtain
’1
H+ = Hc + ±1 sc sT + β0 (sc (Hc ∇f (x+ ))T + (Hc ∇f (x+ ))sT ),
’1 ’1 ’1
(4.27) c c

where
±1 = ±0 + 2β0 /»c .
Also
’1
d+ = ’H+ ∇f (x+ )

T
yc sT sc sT ∇f (x+ )
sc yc c c
’1
=’ I’ Hc I’ ∇f (x+ ) ’
(4.28) T Ts T
yc sc yc c yc sc
’1
= Ac sc + Bc Hc ∇f (x+ ),

where
T
yc sT sT ∇f (x+ )
yc
I’ Tc c
’1
Ac = Hc ∇f (x+ ) +
(4.29) T T
yc sc yc sc »c yc sc


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




80 ITERATIVE METHODS FOR OPTIMIZATION

and
sT ∇f (x+ )
Bc = ’1 + c T .
(4.30)
yc sc
’1
At this point we can compute d+ , and therefore »+ and s+ using only Hc ∇f (xc ). We do not
need H+ at all. We can now form H+ with the new data for the next iterate and will show that
we do not need to store the vectors {yk }.
Since (verify this!) Bc = 0, we have

s+ Ac sc
’1
Hc ∇f (x+ ) = ’ + .
Bc »+ Bc

Combining this with (4.27) gives
’1
H+ = Hc + ±c sc sT + βc (sc sT + s+ sT ),
’1
(4.31) c + c

where
β0
±c = ±1 + 2β0 Ac /Bc and βc = ’ .
(4.32)
Bc »+
This leads to the expansion
n
’1 ’1
±k sk sT + βk (sk sT + sk+1 sT ).
Hn+1 = H0 +
(4.33) k k+1 k
k=0

Upon re¬‚ection the reader will see that this is a complete algorithm. We can use (4.28) and Hn
to compute dn+1 . Then we can compute »n+1 and sn+1 and use them and (4.32) to compute
’1
±n and βn . This new data can be used to form Hn+1 with (4.33), which we can use to compute
dn+2 and continue the iteration.
In this way only the steps {sk } and the expansion coef¬cients {±k } and {βk } need be stored.
Algorithm bfgsopt is an implementation of these ideas.
Algorithm 4.2.2. bfgsopt(x, f, )

1. g = ’∇f (x), n = 0.

2. While g >
’1
(a) If n = 0, dn = ’H0 g
otherwise compute A, B, and dn using (4.28), (4.29), and (4.30).
(b) Compute »n , sn , and x = xn+1 with the Armijo rule.
(c) If n > 0 compute ±n’1 and βn’1 using (4.32).
(d) g = ’∇f (x), n = n + 1.



4.2.2 A BFGS“Armijo Algorithm
In this section we present a simple implementation that shows how the theoretical results can
be applied in algorithm design. Let H+ GS be the BFGS update from Hc and de¬ne the two
BF

modi¬ed BFGS (MBFGS) updates by

H+ GS
BF
if y T s > 0,
H+ =
(4.34)
if y T s ¤ 0,
I


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




BFGS METHOD 81

and
H+ GS
BF
if y T s > 0,
H+ =
(4.35)
if y T s ¤ 0.
Hc
In the MBFGS1 method, (4.34), the model Hessian is reinitialized to I if y T s ¤ 0. In the
early phase of this iteration, where ∇2 f may have negative eigenvalues, y T s ¤ 0 is certainly
possible and the search direction could be the steepest descent direction for several iterations.
An MBFGS2 step (4.35) keeps the history of the iteration even if y T s ¤ 0. One view
is that this approach keeps as much information as possible. Another is that once y T s ¤ 0,
the iteration history is suspect and should be thrown away. Both forms are used in practice.
Our MATLAB code bfgswopt uses MFBGS1 and maintains an approximation to H ’1 using
Algorithm bfgsopt. We also guard against poor scaling by using (3.50).


4.3 Other Quasi-Newton Methods
The DFP (Davidon, Fletcher, Powell) update [71], [72], [105]

(y ’ Hc s)y T + y(y ’ Hc s)T [(y ’ Hc s)T y]yy T
H + = Hc + ’
(4.36)
yT s (y T s)2
has similar local convergence properties to BFGS but does not perform as well in practice [224],
[225].
Two updates that preserve symmetry, but not de¬niteness, are the PSB (Powell symmetric
Broyden) update [219],

<< . .

. 14
( : 32)



. . >>