<< . .

. 13
( : 32)



. . >>

direction, thereby trying more aggressively for superlinear convergence. Implement this
method, perhaps by modifying the MATLAB code ntrust.m, and compare the results
with the examples in §3.4. Prove convergence results like Theorems 3.3.7 and 3.3.9 for
this method.
3.5.12. In [51] a trust region algorithm was proposed that permitted inaccurate gradient computa-
tions, with the relative accuracy being tightened as the iteration progresses. Look at [51]
and try to design a similar algorithm based on the line search paradigm. What problems
do you encounter? How do you solve them?
3.5.13. Suppose one modi¬es Algorithm trtest by not resolving the trust region problem if
the trial point is rejected, but instead performing a line search from xt , and setting ∆ =
x+ ’ xc , where x+ is the accepted point from the line search. Discuss the merits of
this modi¬cation and any potential problems. See [209] for the development of this idea.
3.5.14. Write programs for optimization that take full Gauss“Newton or Newton steps (you can
cripple the MATLAB codes gaussn.m and ntrust.m for this). Apply these codes to
the parameter identi¬cation problem from §3.4.1. What happens?
3.5.15. Write a nonlinear CG code and apply it to the problems in §3.4. Try at least two ways to
manage the line search. How important are the (strong) Wolfe conditions?
3.5.16. Discuss the impact of using a difference Hessian in Algorithm trcg. How will the global
convergence of Algorithm cgtrust be affected? How about the local convergence?
Consider the accuracy in the evaluation of ∇f in your results.
3.5.17. Without looking at [247] describe in general terms how the proof of Theorem 3.3.1 should
be modi¬ed to prove Theorem 3.3.10. Then examine the proof in [247] to see if you left
anything out.




Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




Chapter 4

The BFGS Method

Quasi-Newton methods update an approximation of ∇2 f (x— ) as the iteration progresses. In
general the transition from current approximations xc and Hc of x— and ∇2 f (x— ) to new ap-
proximations x+ and H+ is given (using a line search paradigm) by the following steps:
’1
1. Compute a search direction d = ’Hc ∇f (xc ).
2. Find x+ = xc + »d using a line search to insure suf¬cient decrease.
3. Use xc , x+ , and Hc to update Hc and obtain H+ .
The way in which H+ is computed determines the method.
The BFGS (Broyden, Fletcher, Goldfarb, Shanno) [36], [103], [124], [237] method, which
is the focus of this chapter, and the other methods we will mention in §4.3 are also called secant
methods because they satisfy the secant equation
H+ s = y.
(4.1)
In (4.1)
s = x+ ’ xc and y = ∇f (x+ ) ’ ∇f (xc ).
If N = 1, all secant methods reduce to the classical secant method for the single nonlinear
equation f (x) = 0, i.e.,
f (xc )(xc ’ x’ )
x+ = xc ’ ,
(4.2)
f (xc ) ’ f (x’ )
where x’ is the iterate previous to xc .
The standard quasi-Newton update for nonlinear equations is Broyden™s [34] method, a
rank-one update,
(y ’ Hc s)sT
H+ = Hc + .
(4.3)
sT s
Broyden™s method does not preserve the structural properties needed for line search methods in
optimization, namely, symmetry and positive de¬niteness, and could, in fact, encourage con-
vergence to a local maximum. For that reason quasi-Newton methods in optimization are more
complex than those used for nonlinear equations. The methods of analysis and implementation
are more complex as well.
In this chapter we will concentrate on the BFGS method [36], [103], [124], [237], which is
the rank-two update
yy T (Hc s)(Hc s)T
H+ = Hc + T ’ .
(4.4)
sT Hc s
ys
We will brie¬‚y discuss other updates and variations that exploit problem structure in §4.3.

71
Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




72 ITERATIVE METHODS FOR OPTIMIZATION

4.1 Analysis
This section begins with some simple observations on nonsingularity and positivity of the update.
It is very useful for both theory and practice to express (4.4) in terms of the inverse matrices.
The formula we use in this book is Lemma 4.1.1.
’1
Lemma 4.1.1. Let Hc be spd, y T s = 0, and H+ given by (4.4). Then H+ is nonsingular
and
sy T ysT ssT
’1 ’1
H+ = I ’ T Hc I’ T +T.
(4.5)
ys ys ys


Proof. See exercise 4.5.2.
Lemma 4.1.2. Let Hc be spd, y T s > 0, and H+ given by (4.4). Then H+ is spd.
Proof. Positivity of Hc and y T s = 0 imply that for all z = 0,

(z T y)2 (z T Hc s)2
T T
z H+ z = + z Hc z ’ T .
yT s s Hc s

Using the symmetry and positivity of Hc , we have

(z T Hc s)2 ¤ (sT Hc s)(z T Hc z),

with equality only if z = 0 or s = 0, and, therefore, since z, s = 0 and y T s > 0,

(z T y)2
z T H+ z > ≥ 0,
yT s

as asserted.
If y T s ¤ 0 the update is considered a failure.


4.1.1 Local Theory
The local theory [37] requires accurate initial approximations to both x— and ∇2 f (x— ). The
statement of the convergence result is easy to understand.
Theorem 4.1.3. Let the standard assumptions hold. Then there is δ such that if

x0 ’ x— ¤ δ and H0 ’ ∇2 f (x— ) ¤ δ,

then the BFGS iterates are de¬ned and converge q-superlinearly to x— .


Technical Details
The proof of Theorem 4.1.3 is technical and we subdivide it into several lemmas. Our proof is
a hybrid of ideas from [37], [135], and [154]. Similar to other treatments of this topic [45] we
begin with the observation (see §2.5.2) that one may assume ∇2 f (x— ) = I for the convergence
analysis.
Lemma 4.1.4. Let the standard assumptions hold and let

ˆ
f (y) = f (Ay),


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




BFGS METHOD 73

ˆ
where A = (∇2 f (x— ))’1/2 . Let xc and Hc be given and let xc = A’1 xc and Hc = AHc A.
ˆ
ˆ

Then the BFGS updates (x+ , H+ ) for f and (ˆ+ , H+ ) for f are related by
ˆ
x+ = A’1 x+ and H+ = AH+ A.
ˆ

In particular, the BFGS sequence for f exists (i.e., Hn is spd for all n) if and only if the BFGS
ˆ
sequence for f does and the convergence of {xn } is q-superlinear if and only if the convergence
of {ˆn } is.
x
Proof. The proof is a simple calculation and is left for exercise 4.5.3.
Hence we can, with no loss of generality, assume that ∇2 f (x— ) = I, for if this is not so, we
ˆ
can replace f by f and obtain an equivalent problem for which it is.
Keeping in mind our assumption that ∇2 f (x— ) = I, we denote errors in the inverse Hessian
by
E = H ’1 ’ ∇2 f (x— )’1 = H ’1 ’ I.
These errors satisfy a simple recursion [37].
Lemma 4.1.5. Let the standard assumptions hold. Let Hc be spd and
’1
x+ = xc ’ Hc ∇f (xc ).

Then there is δ0 such that if

0 < xc ’ x— ¤ δ0 and Ec ¤ δ0 ,

then y T s > 0. Moreover, if H+ is the BFGS update of Hc then

E+ = (I ’ wwT )Ec (I ’ wwT ) + ∆,
(4.6)

where w = s/ s and for some K∆ > 0

∆ ¤ K∆ s .
(4.7)

Proof. Let δ0 be small enough so that ∇f (xc ) = 0 if xc = x— . Theorem 1.2.1 implies that
1
∇2 f (x— + tec )ec dt = ec + ∆1 ec ,
∇f (xc ) =
0

where ∆1 is the matrix given by
1
(∇2 f (x— + tec ) ’ I) dt.
∆1 =
0

Clearly
∆1 ¤ γ ec /2,
and
’1
s = ’Hc ∇f (xc ) = ’(I + Ec )(I + ∆1 )ec .
Therefore,
ec (1 ’ δ0 )(1 ’ γδ0 /2) ¤ s ¤ ec (1 + δ0 )(1 + γδ0 /2)
and hence
0 < ec /2 ¤ s ¤ 2 ec
(4.8)
if, say,
δ0 ¤ min(1/4, 1/(2γ)).
(4.9)


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




74 ITERATIVE METHODS FOR OPTIMIZATION

We will assume that (4.9) holds for the rest of this section.
The standard assumptions, our assumption that ∇2 f (x— ) = I, and the fundamental theorem
of calculus imply that
1
∇2 f (xc + ts)s dt
y = ∇f (x+ ) ’ ∇f (xc ) =
0
(4.10)
1
(∇2 f (xc + ts) ’ I)s dt = s + ∆2 s,
=s+
0

where ∆2 is the matrix given by
1
(∇2 f (xc + ts) ’ I) dt.
∆2 =
0

The standard assumptions imply that ∆2 ¤ γ( e+ + ec )/2. Hence, (4.8) implies that

y T s = sT s + (∆2 s)T s ≥ s 2 (1 ’ 3γ ec /2) s 2 (1 ’ 3γδ0 /2) > 0
(4.11)

provided δ0 < 2γ/3. We have that

sy T ssT + s(∆2 s)T ssT
= T ’ ∆3 = wwT ’ ∆3 ,
=T
(4.12)
yT s s s + (∆2 s)T s ss
where (see exercise 4.5.4), for some C > 0,

∆3 ¤ C s .
(4.13)

Subtracting (∇2 f (x— ))’1 = I from (4.5) and using (4.12) gives us

= (I ’ wwT + ∆3 )Hc (I ’ wwT + ∆T ) + wwT ’ I
’1
E+ 3


= (I ’ wwT )(Ec + I)(I ’ wwT ) + wwT ’ I + ∆

= (I ’ wwT )Ec (I ’ wwT ) + ∆,

where
∆ = ∆3 Hc (I ’ wwT + ∆T ) + (I ’ wwT )Hc ∆T .
’1 ’1
3 3

Therefore, if (4.9) holds then 1 + δ0 ¤ 3/2 and

∆ ¤ (1 + δ0 ) ∆3 (2 + ∆3 ) ¤ s 3C(2 + C s )/2

¤ 3C s (2 + 2Cδ0 )/2.

Reduce δ0 if necessary so that 2Cδ0 ¤ 1 and the proof is complete with K∆ = 9C/2.
Lemma 4.1.5 implies that the approximate Hessians do not drift too far from the exact Hessian
if the initial data are good. This property, called bounded deterioration in [37], will directly
imply local q-linear convergence.
Corollary 4.1.6. Let the assumptions of Lemma 4.1.5 hold and let δ0 be as in the statement
of Lemma 4.1.5. Then

E+ ¤ Ec + K∆ s ¤ Ec + K∆ ( ec + e+ ).
(4.14)




Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




BFGS METHOD 75

Proof. The leftmost inequality follows from Lemma 4.1.5 and the fact that I ’ wwT is an
orthogonal projection. The ¬nal inequality follows from the triangle inequality.
We are now ready to prove local q-linear convergence. This is of interest in its own right and
is a critical step in the superlinear convergence proof. Note that, unlike the statement and proof
of Theorem 2.3.4, we do not express the estimates in terms of H ’ ∇2 f (x— ) = H ’ I but
in terms of E = H ’1 ’ I. The two approaches are equivalent, since if E ¤ δ < 1/2, then
H ’1 < 3/2 and the Banach lemma implies that H ¤ 2. Hence
’1
Hn ’ I /2 ¤ Hn Hn ’ I

’1 ’1
¤ Hn ’ I = Hn (Hn ’ I)

’1
¤ Hn Hn ’ I ¤ 3 Hn ’ I /2.

Theorem 4.1.7. Let the standard assumptions hold and let σ ∈ (0, 1). Then there is δ such
that if
’1
x0 ’ x— ¤ δ and H0 ’ ∇2 f (x— )’1 ¤ δ ,
(4.15)
then the BFGS iterates are de¬ned and converge q-linearly to x— with q-factor at most σ.
ˆ
Proof. For δ suf¬ciently small and
ˆ ˆ
xc ’ x— ¤ δ and Ec = Hc ’ I ¤ δ,
’1
(4.16)
¯
the standard assumptions imply that there is K such that
¯ˆ
¯ ec + ec 2 )/2 ¤ K δ ec .
e+ ¤ K( Ec
(4.17)
ˆ ¯ˆ
Reduce δ if necessary so that K δ ¤ σ to obtain e+ ¤ σ ec . The method of proof is to select
δ so that (4.16) is maintained for the entire iteration if the initial iterates satisfy (4.15).
With this in mind we set
’1
K∆ (1 + σ)

< δ — /2
δ = δ /2 1 +
(4.18)
1’σ
where K∆ is from Lemma 4.1.5. Now if H0 ’ I < δ then
E0 ¤ δ /(1 ’ δ ) ¤ 2δ < δ —
which is the estimate we need.
Now by Corollary 4.1.6
E1 ¤ E0 + K∆ (1 + σ) e0 .
The proof will be complete if we can show that (4.15) and (4.18) imply that En < δ — for all
n. We do this inductively. If En < δ — and ej+1 ¤ σ ej for all j ¤ n, then (4.14) implies
that
En+1 ¤ En + K∆ ( en + en+1 ) ¤ En + K∆ (1 + σ) en

¤ En + K∆ (1 + σ)σ n e0 ¤ En + K∆ (1 + σ)σ n δ
n
σn
¤ E0 + δ K∆ (1 + σ)
j=0


K∆ (1 + σ)
=δ 1+ .
1’σ
We complete the induction and the proof by invoking (4.18) to conclude that En+1 ¤ δ — .


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




76 ITERATIVE METHODS FOR OPTIMIZATION

Proof of Theorem 4.1.3
The Dennis“Mor´ condition [82], [81] is a necessary and suf¬cient condition for superlinear
e
convergence of quasi-Newton methods. In terms of the assumptions we make in this section,
the condition is
En sn
lim = 0,
(4.19)
sn
n’∞

where {sn } is the sequence of steps and {En } is the sequence of errors in the inverse Hessian.
We will only state and prove the special case of the necessary condition that we need and refer
the reader to [82], [81], [84], or [154] for more general proofs.
Theorem 4.1.8. Let the standard assumptions hold; let {Hn } be a sequence of nonsingular
N — N matrices satisfying
Hn ¤ M
(4.20)
for some M > 0. Let x0 ∈ RN be given and let {xn }∞ be given by

<< . .

. 13
( : 32)



. . >>