<< . .

. 15
( : 32)



. . >>


(y ’ Hc s)sT + s(y ’ Hc s)T [sT (y ’ Hc s)]ssT
H + = Hc + ’ ,
(4.37)
sT s (sT s)2

and the symmetric rank-one (SR1) [35] update,

(y ’ Hc s)(y ’ Hc s)T
H+ = Hc + .
(4.38)
(y ’ Hc s)T s
By preserving the symmetry of the approximate Hessians, but not the positive de¬niteness,
these updates present a problem for a line search globalization but an opportunity for a trust
region approach. The SR1 update has been reported to outperform BFGS algorithms in certain
cases [165], [41], [64], [65], [163], [258], [118], [250], [119], [268], [164], in which either the
approximate Hessians can be expected to be positive de¬nite or a trust region framework is used
[41], [64], [65].
One may update the inverse of the SR1 approximate Hessian using the Sherman“Morrison
formula, (4.39), a simple relation between the inverse of a nonsingular matrix and that of a
rank-one update of that matrix [93], [239], [240], [14].
Proposition 4.3.1. Let H be a nonsingular N —N matrix and let u, v ∈ RN . Then H +uv T
is invertible if and only if 1 + v T H ’1 u = 0. In this case

(H ’1 u)v T
(H + uv T )’1 = H ’1 .
I’
(4.39)
1 + v T H ’1 u

The proof is simply a direct veri¬cation of (4.39).
The SR1 algorithm terminates in ¬nitely many iterations for convex quadratic optimization
problems [101]. Since the denominator (y ’ Hc s)T s could vanish, the update could completely
fail and implementations must examine the denominator and take appropriate action if it is too


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




82 ITERATIVE METHODS FOR OPTIMIZATION

small. This update does not enforce or require positivity of the approximate Hessian and has
been used effectively to exploit negative curvature in a trust region context [165], [41].
For overdetermined nonlinear least squares problems one can try to approximate the second-
order term in ∇2 f while computing R T R exactly. Suppose

∇2 f (x) ≈ H = C(x) + A,

where the idea is that C, the computed part, is signi¬cantly easier to compute than A, the
approximated part. This is certainly the case for nonlinear least squares, where C = R T R . A
quasi-Newton method that intends to exploit this structure will update A only; hence

H+ = C(x+ ) + A+ .

Superlinear convergence proofs require, in one way or another, that H+ s = y. Therefore, in
terms of A, one might require the update to satisfy

A+ s = y # = y ’ C(x+ )s.
(4.40)

The de¬nition of y # given in (4.40) is called the default choice in [87]. This is not the only
choice for y # , and one can prove superlinear convergence for this and many other choices [87],
[84]. This idea, using several different updates, has been used in other contexts, such as optimal
control [159], [164].
An algorithm of this type, using SR1 to update A and a different choice for y # , was suggested
in [20] and [21]. The nonlinear least squares update from [77], [78], and [84] uses a DFP update
and yet another y # to compute A+ ,
T T
(y # ’ Ac s)y # + y # (y # ’ Ac s)T [(y # ’ Ac s)T y # ]y # y #
(4.41) A+ = Ac + ’ .
T T
y# s (y # s)2

The application of this idea to large-residual least squares problems is not trivial, and scaling
issues must be considered in a successful implementation.
Our proof of superlinear convergence can be applied to updates like (4.41). We state a special
case of a result from [87] for the BFGS formulation
T
y# y# (Ac s)(Ac s)T
A+ = Ac + ’ .
(4.42) T sT Ac s
y# s

Theorem 4.3.2. Let the standard assumptions hold and assume that

A— = ∇2 f (x— ) ’ C(x— )

is spd. Then there is δ such that if

x0 ’ x— ¤ δ and A0 ’ A— ¤ δ,

then the quasi-Newton iterates de¬ned by (4.42) exist and converge q-superlinearly to x— .
This result can be readily proved using the methods in this chapter (see [159]).
Quasi-Newton methods can also be designed to take into account special structure, such as
the sparsity pattern of the Hessian. One can update only those elements that are nonzero in the
initial approximation to the Hessian, requiring that the secant equation Hs = y holds. Such
updates have been proposed and analyzed in varying levels of generality in [83], [87], [185],
[238], [256], and [255].


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




BFGS METHOD 83

Another approach is to use the dependency of f on subsets of the variables, a structure that is
often present in discretizations of in¬nite-dimensional problems where coef¬cients of operators
can be updated rather than entire matrix representations of those operators. We refer the reader
to [133], [131], and [132] for an algebraic viewpoint based on ¬nite-dimensional analysis and
to [159], [157], [164], [160], and [163] for an operator theoretic description of these methods.
When applied to discretizations of in¬nite-dimensional optimization problems, quasi-Newton
methods perform best when they also work well on the in¬nite-dimensional problem itself. Work
on BFGS in Hilbert space can be found, for example, in [135], [158], and [159].
Quasi-Newton methods have been designed for underdetermined problems [184], and Broy-
den™s method itself has been applied to linear least squares problems [111], [148].


4.4 Examples
The computations in this section were done with the MATLAB code bfgswopt. For the small
parameter ID problem, where evaluation of f is far more expensive than the cost of maintaining
or factoring the (very small!) approximate Hessian, one could also use a brute force approach
in which H is updated and factored anew with each iteration.

4.4.1 Parameter ID Problem
We solve the parameter ID problem with the same data as in §3.4.1 using H0 = I as the initial
Hessian. We compare the BFGS solution with the Gauss“Newton iteration from §3.4.1. From
Figure 4.1 one can see the local superlinear convergence and the good performance of the line
search. However, as one should expect, the Gauss“Newton iteration, being designed for small
residual least squares problems, was more ef¬cient. The Gauss“Newton iteration required 14
function evaluations, 6 gradients, and roughly 1.3 million ¬‚oating point operations, while the
BFGS“Armijo iteration needed 29 function evaluations, 15 gradients, and 3.8 million ¬‚oating
point operations.

4.4.2 Discrete Control Problem
We return to the example from §3.4.2. For our ¬rst example we use the initial iterate

u0 (t) = 10.

BFGS also requires an initial approximation to the Hessian and we consider two such approxi-
mations:
Hp = .25I and Hg = I.
(4.43)
The Hessian for the continuous problem is a compact perturbation of the identity and the
theory from [158] and [135] indicates that Hg is a much better approximate Hessian than Hp .
The results in Figure 4.2 support that idea. For the better Hessian, one can see the concavity
of superlinear convergence in the plot of the gradient norm. The computation for the better
Hessian required 12 iterations and roughly 572 thousand ¬‚oating point operations, while the one
with the poor Hessian took 16 iterations and roughly 880 thousand ¬‚oating point operations.
Stepsize reductions were not required for the good Hessian and were needed four times during
the iteration for the poor Hessian. However, the guard against poor scaling (3.50) was needed
in both cases.
When we used the same poor initial iterate that we used in §3.4.2

u0 (t) = 5 + 300 sin(20πt)




Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




84 ITERATIVE METHODS FOR OPTIMIZATION

Gauss Newton Gauss Newton
2 5
10 10


0 0
10 10




Function Value
Gradient Norm



2 5
10 10


4 10
10 10


6 15
10 10
0 2 4 6 0 2 4 6
Iterations Iterations

BFGS BFGS
5 5
10 10


0
10


Function Value
Gradient Norm




0 5
10 10


10
10


5 15
10 10
0 5 10 15 0 5 10 15
Iterations Iterations


Figure 4.1: BFGS“Armijo and Gauss“Newton for the Parameter ID Problem


Better Hessian Better Hessian
5 5
10 10
Function Value
Gradient Norm




0
10
4
10
5
10



10 3
10 10
0 5 10 15 0 5 10 15
Iterations Iterations

Poor Hessian Poor Hessian
5 5
10 10
Function Value
Gradient Norm




0
10
4
10
5
10



10 3
10 10
0 5 10 15 20 0 5 10 15 20
Iterations Iterations


Figure 4.2: BFGS“Armijo for Discrete Control Problem




Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




BFGS METHOD 85

and allocated 50 vectors to Algorithm bfgsopt, there was no longer a bene¬t to using the
good Hessian. In fact, as is clear from Figure 4.3 the poor Hessian produced a more rapidly
convergent iteration.

Better Hessian Better Hessian
10 8
10 10


5
10




Function Value
Gradient Norm




6
10
0
10
4
10
5
10


10 2
10 10
0 50 100 150 0 50 100 150
Iterations Iterations

Poor Hessian Poor Hessian
10 8
10 10


5
10
Function Value
Gradient Norm




6
10
0
10
4
10
5
10


10 2
10 10
0 20 40 60 80 0 20 40 60 80
Iterations Iterations


Figure 4.3: BFGS“Armijo for Discrete Control Problem: Poor Initial Iterate



4.5 Exercises on BFGS
4.5.1. Use the secant method (4.2) with initial data x’1 = 1 and x0 = .9 to minimize f (x) = x4 .
Explain the convergence of the iteration.

4.5.2. Prove Lemma 4.1.1. It might help to use the secant equation.

4.5.3. Prove Lemma 4.1.4.

4.5.4. Verify (4.13) and compute the constant C.

4.5.5. Prove (4.25).

4.5.6. As an exercise in character building, implement Algorithm bfgsrec nonrecursively.

4.5.7. Show how the Sherman“Morrison formula can be used to implement the SR1 update in
such a way that only one vector need be stored for each iterate.

4.5.8. State and prove a local convergence theorem for DFP and/or PSB.

4.5.9. Implement the DFP and PSB update and compare their performance with BFGS on the
examples from §4.4.


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




86 ITERATIVE METHODS FOR OPTIMIZATION

4.5.10. Show that, for positive de¬nite quadratic problems, the BFGS method with an exact line
search (i.e., one that ¬nds the minimum of f in the search direction) is the same as CG
[201], [200].
4.5.11. Prove Theorem 4.3.2.




Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




Chapter 5

Simple Bound Constraints

<< . .

. 15
( : 32)



. . >>