<< . .

. 9
( : 32)



. . >>



3.2.3 Damped Gauss“Newton Iteration
As we showed in §2.4, the steepest descent direction for the overdetermined least squares objec-
tive
M
1 1
ri (x) 2 = R(x)T R(x)
f (x) = 2
2 i=1 2

is
’∇f (x) = ’R (x)T R(x).
The steepest descent algorithm could be applied to nonlinear least squares problems with the
good global performance and poor local convergence that we expect.
The Gauss“Newton direction at x

dGS = ’(R (x)T R (x))’1 R (x)T R(x)

is not de¬ned if R fails to have full column rank. If R does have full column rank, then

(dGS )T ∇f (x) = ’(R (x)T R(x))T (R (x)T R (x))’1 R (x)T R(x) < 0,

and the Gauss“Newton direction is a descent direction. The combination of the Armijo rule with
the Gauss“Newton direction is called damped Gauss“Newton iteration.
A problem with the damped Gauss“Newton algorithm is that, in order for Theorem 3.2.4 to
be applicable, the matrices {R (xk )T R (xk )} must not only have full column rank but also must
be uniformly bounded and well conditioned, which are very strong assumptions (but if they are
satis¬ed, damped Gauss“Newton is a very effective algorithm).
The Levenberg“Marquardt method [172], [183] addresses these issues by adding a regular-
ization parameter ν > 0 to R (xc )T R (xc ) to obtain x+ = xc + s where

s = ’(νc I + R (xc )T R (xc ))’1 R (xc )T R(xc ),
(3.20)

where I is the N — N identity matrix. The matrix νc I + R (xc )T R (xc ) is positive de¬nite.
The parameter ν is called the Levenberg“Marquardt parameter.
It is not necessary to compute R (xc )T R (xc ) to compute a Levenberg“Marquardt step [76].
One can also solve the full-rank (M + N ) — N linear least squares problem
2
1 R (xc ) R(xc )

min s+
(3.21)
νc I 0
2

to compute s (see exercise 3.5.6). Compare this with computing the undamped Gauss“Newton
step by solving (2.33).
If one couples the Levenberg“Marquardt method with the Armijo rule, then Theorem 3.2.4
is applicable far from a minimizer and Theorem 2.4.1 nearby. We ask the reader to provide the
details of the proof in exercise 3.5.7.


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




48 ITERATIVE METHODS FOR OPTIMIZATION

Theorem 3.2.7. Let R be Lipschitz continuous. Let xk be the Levenberg“Marquardt“Armijo
iterates. Assume that R (xk ) is uniformly bounded and that the sequence of Levenberg“
Marquardt parameters {νk } is such that

κ(νk I + R (xk )T R (xk ))

is bounded. Then
lim R (xk )T R(xk ) = 0.
k’∞

Moreover, if x— is any limit point of {xk } at which R(x— ) = 0, Assumption 2.4.1 holds, and
νk ’ 0, then xk ’ x— q-superlinearly. If, moreover,

νk = O( R(xk ) )

as k ’ ∞ then the convergence is q-quadratic.
For example, if κ(R (xk )T R (xk )) and R (xk ) are bounded then νk = min(1, R(xk ) )
would satisfy the assumptions of Theorem 3.2.7. For a zero residual problem, this addresses the
potential conditioning problems of the damped Gauss“Newton method and still gives quadratic
convergence in the terminal phase of the iteration. The Levenberg“Marquardt“Armijo iteration
will also converge, albeit slowly, for a large residual problem.
We will not discuss globally convergent methods for underdetermined least squares problems
in this book. We refer the reader to [24], [252], and [253] for discussion of underdetermined
problems.

3.2.4 Nonlinear Conjugate Gradient Methods
Operationally, the conjugate gradient iteration for a quadratic problem updates the current iter-
ation with a linear combination of the current residual r and a search direction p. The search
direction is itself a linear combination of previous residuals. Only r and p need be stored to
continue the iteration. The methods discussed in this section seek to continue this idea to more
nonlinear problems.
Nonlinear conjugate gradient algorithms have the signi¬cant advantage of low storage over
most of the other algorithms covered in this book, the method of steepest descent being the
exception. For problems so large that the Newton or quasi“Newton methods cannot be imple-
mented using the available storage, these methods are among the few options (see [177] and [5]
for examples).
Linear conjugate gradient seeks to minimize f (x) = xT Hx/2 ’ xT b. The residual r =
b ’ Hx is simply ’∇f (x), leading to a natural extension to nonlinear problems in which
r0 = p0 = ∇f (x0 ) and, for k ≥ 1,

rk = ∇f (xk ) and pk = rk + βk pk’1 .
(3.22)

The update of x
xk+1 = xk + ±k pk
can be done with a simple analytic minimization in the quadratic case, but a line search will be
needed in the nonlinear case. The missing pieces, therefore, are the choice of βk , the way the
line search will be done, and convergence theory. Theory is needed, for example, to decide if
pk is a descent direction for all k.
The general form of the algorithm is very simple. The inputs are an initial iterate, which
will be overwritten by the solution, the function to be minimized, and a termination vector
„ = („r , „a ) of relative and absolute residuals.
Algorithm 3.2.2. nlcg(x, f, „ )


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




GLOBAL CONVERGENCE 49

1. r0 = ∇f (x) ; k = 0

2. Do while ∇f (x) > „r r0 + „a

(a) If k = 0 then p = ’∇f (x) else
p = ’∇f (x) + βp
(b) x = x + ±p

The two most common choices for β, both of which are equivalent to the linear CG formula
in the quadratic case, are the Fletcher“Reeves [106]

∇f (xk ) 2
βk R
F
=
(3.23)
∇f (xk’1 ) 2

and Polak“Ribi` re [215], [216]
e

∇f (xk )T (∇f (xk ) ’ ∇f (xk’1 ))
βk =
(3.24)
∇f (xk’1 ) 2

formulas. The Fletcher“Reeves method has been observed to take long sequences of very small
steps and virtually stagnate [112], [207], [208], [226]. The Polak“Ribi` re formula performs
e
much better and is more commonly used but has a less satisfactory convergence theory.
The line search has more stringent requirements, at least for proofs of convergence, than
are satis¬ed by the Armijo method that we advocate for steepest descent. We require that the
steplength parameter satis¬es the Wolfe conditions [272], [273]

f (xk + ±k pk ) ¤ f (xk ) + σ± ±k ∇f (xk )T pk
(3.25)

and
∇f (xk + ±k pk )T pk ≥ σβ ∇f (xk )T pk ,
(3.26)
where 0 < σ± < σβ < 1. The ¬rst of the Wolfe conditions (3.25) is the suf¬cient decrease
condition, (3.4), that all line search algorithms must satisfy. The second (3.26) is often, but not
always, implied by the Armijo backtracking scheme of alternating a test for suf¬cient decrease
and reduction of the steplength. One can design a line search method that will, under modest
assumptions, ¬nd a steplength satisfying the Wolfe conditions [104], [171], [193].
The convergence result [3] for the Fletcher“Reeves formula requires a bit more. The proof
that pk is descent direction requires the strong Wolfe conditions, which replace (3.26) by

|∇f (xk + ±k pk )T pk | ¤ ’σβ ∇f (xk )T pk
(3.27)

and demand that 0 < σ± < σβ < 1/2. The algorithm from [193], for example, will ¬nd a point
satisfying the strong Wolfe conditions.
Theorem 3.2.8. Assume that the set

N = {x | f (x) ¤ f (x0 )}

is bounded and that f is Lipschitz continuously differentiable in a neighborhood of N . Let
Algorithm nlcg be implemented with the Fletcher“Reeves formula and a line search that satis¬es
the strong Wolfe conditions. Then
lim ∇f (xk ) = 0.




Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




50 ITERATIVE METHODS FOR OPTIMIZATION

This result has been generalized to allow for any choice of βk such that |βk | ¤ βk R [112].
F

A similar result for the Polak“Ribi` re method, but with more complex conditions on the line
e
search, has been proved in [134]. This complexity in the line search is probably necessary, as
there are examples where reasonable line searches lead to failure in the Polak“Ribi` re method,
e
PR PR
[222]. One can also prove convergence if βk is replaced by max(βk , 0) [112].
There is continuing research on these methods and we point to [112], [134], [205], and [202]
as good sources.


3.3 Trust Region Methods
Trust region methods overcome the problems that line search methods encounter with non-spd
approximate Hessians. In particular, a Newton trust region strategy allows the use of complete
Hessian information even in regions where the Hessian has negative curvature. The speci¬c trust
region methods we will present effect a smooth transition from the steepest descent direction to
the Newton direction in a way that gives the global convergence properties of steepest descent
and the fast local convergence of Newton™s method.
The idea is very simple. We let ∆ be the radius of the ball about xc in which the quadratic
model
mc (x) = f (xc ) + ∇f (xc )T (x ’ xc ) + (x ’ xc )T Hc (x ’ xc )/2
can be trusted to accurately represent the function. ∆ is called the trust region radius and the
ball
T (∆) = {x | x ’ xc ¤ ∆}
is called the trust region.
We compute the new point x+ by (approximately) minimizing mc over T (∆). The trust
region problem for doing that is usually posed in terms of the difference st between xc and the
minimizer of mc in the trust region
min mc (xc + s).
(3.28)
s ¤∆

We will refer to either the trial step st or the trial solution xt = xc + st as the solution to the
trust region problem.
Having solved the trust region problem, one must decide whether to accept the step and/or to
change the trust region radius. The trust region methods that we will discuss in detail approximate
the solution of the trust region problem with the minimizer of the quadratic model along a
piecewise linear path contained in the trust region. Before discussing these speci¬c methods,
we present a special case of a result from [223] on global convergence.
A prototype trust region algorithm, upon which we base the speci¬c instances that follow, is
Algorithm 3.3.1.
Algorithm 3.3.1. trbasic(x, f )
1. Initialize the trust region radius ∆.
2. Do until termination criteria are satis¬ed
(a) Approximately solve the trust region problem to obtain xt .
(b) Test both the trial point and the trust region radius and decide whether or not to
accept the step, the trust region radius, or both. At least one of x or ∆ will change
in this phase.

Most trust region algorithms differ only in how step 2a in Algorithm trbasic is done.
There are also different ways to implement step 2b, but these differ only in minor details and the
approach we present next in §3.3.1 is very representative.


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




GLOBAL CONVERGENCE 51

3.3.1 Changing the Trust Region and the Step
The trust region radius and the new point are usually tested simultaneously. While a notion of
suf¬cient decrease is important, the test is centered on how well the quadratic model approximates
the function inside the trust region. We measure this by comparing the actual reduction in f

ared = f (xc ) ’ f (xt )

with the predicted reduction, i.e., the decrease in the quadratic model

pred = mc (xc ) ’ mc (xt ) = ’∇f (xc )T st ’ sT Hc st /2.
t

pred > 0 for all the trust region algorithms we discuss in this book unless ∇f (xc ) = 0. We
will introduce three control parameters

µ0 ¤ µlow < µhigh ,

which are used to determine if the trial step should be rejected (ared/pred < µ0 ) and/or the trust
region radius should be decreased (ared/pred < µlow ), increased (ared/pred > µhigh ), or left
unchanged. Typical values are .25 for µlow and .75 for µhigh . Both µ0 = 10’4 or µ0 = µlow
are used. One can also use the suf¬cient decrease condition (3.4) to determine if the trial step
should be accepted [84].
We will contract and expand the trust region radius by simply multiplying ∆ by constants

0 < ωdown < 1 < ωup .

Typical values are ωdown = 1/2 and ωup = 2. There are many other ways to implement a trust
region adjustment algorithm that also give global convergence. For example, the relative error
|pred ’ ared|/ ∇f can be used [84] rather than the ratio ared/pred. Finally we limit the
number of times the trust region radius can be expanded by requiring

∆ ¤ CT ∇f (xc ) ,
(3.29)

for some CT > 1, which may depend on xc . This only serves to eliminate the possibility of
in¬nite expansion and is used in the proofs. Many of the dogleg methods which we consider
later automatically impose (3.29).
The possibility of expansion is important for ef¬ciency in the case of poor scaling of f .
The convergence theory presented here [162] also uses the expansion phase in the proof of
convergence, but that is not essential. We will present the algorithm to test the trust region in a
manner, somewhat different from much of the literature, that only returns once a new iterate has
been accepted.
Algorithm 3.3.2. trtest(xc , xt , x+ , f, ∆)
1. z = xc
2. Do while z = xc
(a) ared = f (xc ) ’ f (xt ), st = xt ’ xc , pred = ’∇f (xc )T st ’ sT Hc st /2
t
(b) If ared/pred < µ0 then set z = xc , ∆ = ωdown ∆, and solve the trust region
problem with the new radius to obtain a new trial point. If the trust region radius
was just expanded, set z = xold .
t
(c) If µ0 ¤ ared/pred < µlow , then set z = xt and ∆ = ωdown ∆.
(d) If µlow ¤ ared/pred ¤ µhigh , set z = xt .


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




52 ITERATIVE METHODS FOR OPTIMIZATION

(e) If µhigh < ared/pred and st = ∆ ¤ CT ∇f (xc ) , then set z = xc , ∆ = ωup ∆,
and solve the trust region problem with the new radius to obtain a new trial point.
Store the old trial point as xold in case the expansion fails.
t

3. x+ = z.

The loop in Algorithm trtest serves the same purpose as the loop in a line search algorithm
such as Algorithm steep. One must design the solution to the trust region problem in such a
way that that loop will terminate after ¬nitely many iterations and a general way to do that is the
subject of the next section.
We incorporate Algorithm trtest into a general trust region algorithm paradigm that we
will use for the remainder of this section.
Algorithm 3.3.3. trgen(x, f )
1. Initialize ∆
2. Do forever
(a) Let xc = x. Compute ∇f (xc ) and an approximate Hessian Hc .
(b) Solve the trust region problem to obtain a trial point xt .
(c) Call trtest(xc , xt , x, f, ∆)

Hessians and gradients are computed only in step 2a of Algorithm trgen.

3.3.2 Global Convergence of Trust Region Algorithms
While one can, in principal, solve the trust region problem exactly (see §3.3.4), it is simpler
and more ef¬cient to solve the problem approximately. It is amazing that one need not do a
very good job with the trust region problem in order to get global (and even locally superlinear)
convergence.
Our demands of our solutions of the trust region problem and our local quadratic models
are modest and readily veri¬able. The parameter σ in part 1 of Assumption 3.3.1, like the
parameter CT in (3.29), is used in the analysis but plays no role in implementation. In the
speci¬c algorithms that we discuss in this book, σ can be computed. Part 2 follows from well-
conditioned and bounded model Hessians if Algorithm trtest is used to manage the trust
region.
Assumption 3.3.1.
1. There is σ > 0 such that

pred = f (xc ) ’ mc (xt ) ≥ σ ∇f (xc ) min( st , ∇f (xc ) ).
(3.30)

2. There is M > 0 such that either st ≥ ∇f (xc ) /M or st = ∆c .

The global convergence theorem based on this assumption should be compared with the
similar result on line search methods”Theorem 3.2.4.
Theorem 3.3.1. Let ∇f be Lipschitz continuous with Lipschitz constant L. Let {xk }
be generated by Algorithm trgen and let the solutions for the trust region problems satisfy
Assumption 3.3.1. Assume that the matrices {Hk } are bounded. Then either f is unbounded
from below, ∇f (xk ) = 0 for some ¬nite k, or

lim ∇f (xk ) = 0.
(3.31)
k’∞



Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




GLOBAL CONVERGENCE 53



Proof. Assume that ∇f (xk ) = 0 for all k and that f is bounded from below. We will show
that there is MT ∈ (0, 1] such that once an iterate is taken (i.e., the step is accepted and the trust
region radius is no longer a candidate for expansion), then

sk ≥ MT ∇f (xk ) .
(3.32)

Assume (3.32) for the present. Since sk is an acceptable step, Algorithm trtest and part 1 of
Assumption 3.3.1 imply that

aredk ≥ µ0 predk ≥ µ0 ∇f (xk ) σ min( sk , ∇f (xk ) ).

We may then use (3.32) to obtain

aredk ≥ µ0 σMT ∇f (xk ) 2 .
(3.33)

Now since f (xk ) is a decreasing sequence and f is bounded from below, limk’∞ aredk = 0.
Hence (3.33) implies (3.31).
It remains to prove (3.32). To begin note that if sk < ∆k then by part 2 of Assumption 3.3.1

<< . .

. 9
( : 32)



. . >>