3.2.3 Damped Gauss“Newton Iteration

As we showed in §2.4, the steepest descent direction for the overdetermined least squares objec-

tive

M

1 1

ri (x) 2 = R(x)T R(x)

f (x) = 2

2 i=1 2

is

’∇f (x) = ’R (x)T R(x).

The steepest descent algorithm could be applied to nonlinear least squares problems with the

good global performance and poor local convergence that we expect.

The Gauss“Newton direction at x

dGS = ’(R (x)T R (x))’1 R (x)T R(x)

is not de¬ned if R fails to have full column rank. If R does have full column rank, then

(dGS )T ∇f (x) = ’(R (x)T R(x))T (R (x)T R (x))’1 R (x)T R(x) < 0,

and the Gauss“Newton direction is a descent direction. The combination of the Armijo rule with

the Gauss“Newton direction is called damped Gauss“Newton iteration.

A problem with the damped Gauss“Newton algorithm is that, in order for Theorem 3.2.4 to

be applicable, the matrices {R (xk )T R (xk )} must not only have full column rank but also must

be uniformly bounded and well conditioned, which are very strong assumptions (but if they are

satis¬ed, damped Gauss“Newton is a very effective algorithm).

The Levenberg“Marquardt method [172], [183] addresses these issues by adding a regular-

ization parameter ν > 0 to R (xc )T R (xc ) to obtain x+ = xc + s where

s = ’(νc I + R (xc )T R (xc ))’1 R (xc )T R(xc ),

(3.20)

where I is the N — N identity matrix. The matrix νc I + R (xc )T R (xc ) is positive de¬nite.

The parameter ν is called the Levenberg“Marquardt parameter.

It is not necessary to compute R (xc )T R (xc ) to compute a Levenberg“Marquardt step [76].

One can also solve the full-rank (M + N ) — N linear least squares problem

2

1 R (xc ) R(xc )

√

min s+

(3.21)

νc I 0

2

to compute s (see exercise 3.5.6). Compare this with computing the undamped Gauss“Newton

step by solving (2.33).

If one couples the Levenberg“Marquardt method with the Armijo rule, then Theorem 3.2.4

is applicable far from a minimizer and Theorem 2.4.1 nearby. We ask the reader to provide the

details of the proof in exercise 3.5.7.

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

48 ITERATIVE METHODS FOR OPTIMIZATION

Theorem 3.2.7. Let R be Lipschitz continuous. Let xk be the Levenberg“Marquardt“Armijo

iterates. Assume that R (xk ) is uniformly bounded and that the sequence of Levenberg“

Marquardt parameters {νk } is such that

κ(νk I + R (xk )T R (xk ))

is bounded. Then

lim R (xk )T R(xk ) = 0.

k’∞

Moreover, if x— is any limit point of {xk } at which R(x— ) = 0, Assumption 2.4.1 holds, and

νk ’ 0, then xk ’ x— q-superlinearly. If, moreover,

νk = O( R(xk ) )

as k ’ ∞ then the convergence is q-quadratic.

For example, if κ(R (xk )T R (xk )) and R (xk ) are bounded then νk = min(1, R(xk ) )

would satisfy the assumptions of Theorem 3.2.7. For a zero residual problem, this addresses the

potential conditioning problems of the damped Gauss“Newton method and still gives quadratic

convergence in the terminal phase of the iteration. The Levenberg“Marquardt“Armijo iteration

will also converge, albeit slowly, for a large residual problem.

We will not discuss globally convergent methods for underdetermined least squares problems

in this book. We refer the reader to [24], [252], and [253] for discussion of underdetermined

problems.

3.2.4 Nonlinear Conjugate Gradient Methods

Operationally, the conjugate gradient iteration for a quadratic problem updates the current iter-

ation with a linear combination of the current residual r and a search direction p. The search

direction is itself a linear combination of previous residuals. Only r and p need be stored to

continue the iteration. The methods discussed in this section seek to continue this idea to more

nonlinear problems.

Nonlinear conjugate gradient algorithms have the signi¬cant advantage of low storage over

most of the other algorithms covered in this book, the method of steepest descent being the

exception. For problems so large that the Newton or quasi“Newton methods cannot be imple-

mented using the available storage, these methods are among the few options (see [177] and [5]

for examples).

Linear conjugate gradient seeks to minimize f (x) = xT Hx/2 ’ xT b. The residual r =

b ’ Hx is simply ’∇f (x), leading to a natural extension to nonlinear problems in which

r0 = p0 = ∇f (x0 ) and, for k ≥ 1,

rk = ∇f (xk ) and pk = rk + βk pk’1 .

(3.22)

The update of x

xk+1 = xk + ±k pk

can be done with a simple analytic minimization in the quadratic case, but a line search will be

needed in the nonlinear case. The missing pieces, therefore, are the choice of βk , the way the

line search will be done, and convergence theory. Theory is needed, for example, to decide if

pk is a descent direction for all k.

The general form of the algorithm is very simple. The inputs are an initial iterate, which

will be overwritten by the solution, the function to be minimized, and a termination vector

„ = („r , „a ) of relative and absolute residuals.

Algorithm 3.2.2. nlcg(x, f, „ )

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

GLOBAL CONVERGENCE 49

1. r0 = ∇f (x) ; k = 0

2. Do while ∇f (x) > „r r0 + „a

(a) If k = 0 then p = ’∇f (x) else

p = ’∇f (x) + βp

(b) x = x + ±p

The two most common choices for β, both of which are equivalent to the linear CG formula

in the quadratic case, are the Fletcher“Reeves [106]

∇f (xk ) 2

βk R

F

=

(3.23)

∇f (xk’1 ) 2

and Polak“Ribi` re [215], [216]

e

∇f (xk )T (∇f (xk ) ’ ∇f (xk’1 ))

βk =

(3.24)

∇f (xk’1 ) 2

formulas. The Fletcher“Reeves method has been observed to take long sequences of very small

steps and virtually stagnate [112], [207], [208], [226]. The Polak“Ribi` re formula performs

e

much better and is more commonly used but has a less satisfactory convergence theory.

The line search has more stringent requirements, at least for proofs of convergence, than

are satis¬ed by the Armijo method that we advocate for steepest descent. We require that the

steplength parameter satis¬es the Wolfe conditions [272], [273]

f (xk + ±k pk ) ¤ f (xk ) + σ± ±k ∇f (xk )T pk

(3.25)

and

∇f (xk + ±k pk )T pk ≥ σβ ∇f (xk )T pk ,

(3.26)

where 0 < σ± < σβ < 1. The ¬rst of the Wolfe conditions (3.25) is the suf¬cient decrease

condition, (3.4), that all line search algorithms must satisfy. The second (3.26) is often, but not

always, implied by the Armijo backtracking scheme of alternating a test for suf¬cient decrease

and reduction of the steplength. One can design a line search method that will, under modest

assumptions, ¬nd a steplength satisfying the Wolfe conditions [104], [171], [193].

The convergence result [3] for the Fletcher“Reeves formula requires a bit more. The proof

that pk is descent direction requires the strong Wolfe conditions, which replace (3.26) by

|∇f (xk + ±k pk )T pk | ¤ ’σβ ∇f (xk )T pk

(3.27)

and demand that 0 < σ± < σβ < 1/2. The algorithm from [193], for example, will ¬nd a point

satisfying the strong Wolfe conditions.

Theorem 3.2.8. Assume that the set

N = {x | f (x) ¤ f (x0 )}

is bounded and that f is Lipschitz continuously differentiable in a neighborhood of N . Let

Algorithm nlcg be implemented with the Fletcher“Reeves formula and a line search that satis¬es

the strong Wolfe conditions. Then

lim ∇f (xk ) = 0.

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

50 ITERATIVE METHODS FOR OPTIMIZATION

This result has been generalized to allow for any choice of βk such that |βk | ¤ βk R [112].

F

A similar result for the Polak“Ribi` re method, but with more complex conditions on the line

e

search, has been proved in [134]. This complexity in the line search is probably necessary, as

there are examples where reasonable line searches lead to failure in the Polak“Ribi` re method,

e

PR PR

[222]. One can also prove convergence if βk is replaced by max(βk , 0) [112].

There is continuing research on these methods and we point to [112], [134], [205], and [202]

as good sources.

3.3 Trust Region Methods

Trust region methods overcome the problems that line search methods encounter with non-spd

approximate Hessians. In particular, a Newton trust region strategy allows the use of complete

Hessian information even in regions where the Hessian has negative curvature. The speci¬c trust

region methods we will present effect a smooth transition from the steepest descent direction to

the Newton direction in a way that gives the global convergence properties of steepest descent

and the fast local convergence of Newton™s method.

The idea is very simple. We let ∆ be the radius of the ball about xc in which the quadratic

model

mc (x) = f (xc ) + ∇f (xc )T (x ’ xc ) + (x ’ xc )T Hc (x ’ xc )/2

can be trusted to accurately represent the function. ∆ is called the trust region radius and the

ball

T (∆) = {x | x ’ xc ¤ ∆}

is called the trust region.

We compute the new point x+ by (approximately) minimizing mc over T (∆). The trust

region problem for doing that is usually posed in terms of the difference st between xc and the

minimizer of mc in the trust region

min mc (xc + s).

(3.28)

s ¤∆

We will refer to either the trial step st or the trial solution xt = xc + st as the solution to the

trust region problem.

Having solved the trust region problem, one must decide whether to accept the step and/or to

change the trust region radius. The trust region methods that we will discuss in detail approximate

the solution of the trust region problem with the minimizer of the quadratic model along a

piecewise linear path contained in the trust region. Before discussing these speci¬c methods,

we present a special case of a result from [223] on global convergence.

A prototype trust region algorithm, upon which we base the speci¬c instances that follow, is

Algorithm 3.3.1.

Algorithm 3.3.1. trbasic(x, f )

1. Initialize the trust region radius ∆.

2. Do until termination criteria are satis¬ed

(a) Approximately solve the trust region problem to obtain xt .

(b) Test both the trial point and the trust region radius and decide whether or not to

accept the step, the trust region radius, or both. At least one of x or ∆ will change

in this phase.

Most trust region algorithms differ only in how step 2a in Algorithm trbasic is done.

There are also different ways to implement step 2b, but these differ only in minor details and the

approach we present next in §3.3.1 is very representative.

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

GLOBAL CONVERGENCE 51

3.3.1 Changing the Trust Region and the Step

The trust region radius and the new point are usually tested simultaneously. While a notion of

suf¬cient decrease is important, the test is centered on how well the quadratic model approximates

the function inside the trust region. We measure this by comparing the actual reduction in f

ared = f (xc ) ’ f (xt )

with the predicted reduction, i.e., the decrease in the quadratic model

pred = mc (xc ) ’ mc (xt ) = ’∇f (xc )T st ’ sT Hc st /2.

t

pred > 0 for all the trust region algorithms we discuss in this book unless ∇f (xc ) = 0. We

will introduce three control parameters

µ0 ¤ µlow < µhigh ,

which are used to determine if the trial step should be rejected (ared/pred < µ0 ) and/or the trust

region radius should be decreased (ared/pred < µlow ), increased (ared/pred > µhigh ), or left

unchanged. Typical values are .25 for µlow and .75 for µhigh . Both µ0 = 10’4 or µ0 = µlow

are used. One can also use the suf¬cient decrease condition (3.4) to determine if the trial step

should be accepted [84].

We will contract and expand the trust region radius by simply multiplying ∆ by constants

0 < ωdown < 1 < ωup .

Typical values are ωdown = 1/2 and ωup = 2. There are many other ways to implement a trust

region adjustment algorithm that also give global convergence. For example, the relative error

|pred ’ ared|/ ∇f can be used [84] rather than the ratio ared/pred. Finally we limit the

number of times the trust region radius can be expanded by requiring

∆ ¤ CT ∇f (xc ) ,

(3.29)

for some CT > 1, which may depend on xc . This only serves to eliminate the possibility of

in¬nite expansion and is used in the proofs. Many of the dogleg methods which we consider

later automatically impose (3.29).

The possibility of expansion is important for ef¬ciency in the case of poor scaling of f .

The convergence theory presented here [162] also uses the expansion phase in the proof of

convergence, but that is not essential. We will present the algorithm to test the trust region in a

manner, somewhat different from much of the literature, that only returns once a new iterate has

been accepted.

Algorithm 3.3.2. trtest(xc , xt , x+ , f, ∆)

1. z = xc

2. Do while z = xc

(a) ared = f (xc ) ’ f (xt ), st = xt ’ xc , pred = ’∇f (xc )T st ’ sT Hc st /2

t

(b) If ared/pred < µ0 then set z = xc , ∆ = ωdown ∆, and solve the trust region

problem with the new radius to obtain a new trial point. If the trust region radius

was just expanded, set z = xold .

t

(c) If µ0 ¤ ared/pred < µlow , then set z = xt and ∆ = ωdown ∆.

(d) If µlow ¤ ared/pred ¤ µhigh , set z = xt .

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

52 ITERATIVE METHODS FOR OPTIMIZATION

(e) If µhigh < ared/pred and st = ∆ ¤ CT ∇f (xc ) , then set z = xc , ∆ = ωup ∆,

and solve the trust region problem with the new radius to obtain a new trial point.

Store the old trial point as xold in case the expansion fails.

t

3. x+ = z.

The loop in Algorithm trtest serves the same purpose as the loop in a line search algorithm

such as Algorithm steep. One must design the solution to the trust region problem in such a

way that that loop will terminate after ¬nitely many iterations and a general way to do that is the

subject of the next section.

We incorporate Algorithm trtest into a general trust region algorithm paradigm that we

will use for the remainder of this section.

Algorithm 3.3.3. trgen(x, f )

1. Initialize ∆

2. Do forever

(a) Let xc = x. Compute ∇f (xc ) and an approximate Hessian Hc .

(b) Solve the trust region problem to obtain a trial point xt .

(c) Call trtest(xc , xt , x, f, ∆)

Hessians and gradients are computed only in step 2a of Algorithm trgen.

3.3.2 Global Convergence of Trust Region Algorithms

While one can, in principal, solve the trust region problem exactly (see §3.3.4), it is simpler

and more ef¬cient to solve the problem approximately. It is amazing that one need not do a

very good job with the trust region problem in order to get global (and even locally superlinear)

convergence.

Our demands of our solutions of the trust region problem and our local quadratic models

are modest and readily veri¬able. The parameter σ in part 1 of Assumption 3.3.1, like the

parameter CT in (3.29), is used in the analysis but plays no role in implementation. In the

speci¬c algorithms that we discuss in this book, σ can be computed. Part 2 follows from well-

conditioned and bounded model Hessians if Algorithm trtest is used to manage the trust

region.

Assumption 3.3.1.

1. There is σ > 0 such that

pred = f (xc ) ’ mc (xt ) ≥ σ ∇f (xc ) min( st , ∇f (xc ) ).

(3.30)

2. There is M > 0 such that either st ≥ ∇f (xc ) /M or st = ∆c .

The global convergence theorem based on this assumption should be compared with the

similar result on line search methods”Theorem 3.2.4.

Theorem 3.3.1. Let ∇f be Lipschitz continuous with Lipschitz constant L. Let {xk }

be generated by Algorithm trgen and let the solutions for the trust region problems satisfy

Assumption 3.3.1. Assume that the matrices {Hk } are bounded. Then either f is unbounded

from below, ∇f (xk ) = 0 for some ¬nite k, or

lim ∇f (xk ) = 0.

(3.31)

k’∞

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

GLOBAL CONVERGENCE 53

Proof. Assume that ∇f (xk ) = 0 for all k and that f is bounded from below. We will show

that there is MT ∈ (0, 1] such that once an iterate is taken (i.e., the step is accepted and the trust

region radius is no longer a candidate for expansion), then

sk ≥ MT ∇f (xk ) .

(3.32)

Assume (3.32) for the present. Since sk is an acceptable step, Algorithm trtest and part 1 of

Assumption 3.3.1 imply that

aredk ≥ µ0 predk ≥ µ0 ∇f (xk ) σ min( sk , ∇f (xk ) ).

We may then use (3.32) to obtain

aredk ≥ µ0 σMT ∇f (xk ) 2 .

(3.33)

Now since f (xk ) is a decreasing sequence and f is bounded from below, limk’∞ aredk = 0.

Hence (3.33) implies (3.31).

It remains to prove (3.32). To begin note that if sk < ∆k then by part 2 of Assumption 3.3.1