In the presence of errors in functions and gradients, however, the problem of unconstrained

optimization becomes more dif¬cult than that of root ¬nding. We discuss this difference only

brie¬‚y here and for the remainder of this chapter assume that gradients are computed exactly, or

at least as accurately as f , say, either analytically or with automatic differentiation [129], [130].

However, we must carefully study the effects of errors in the evaluation of the Hessian just as

we did those of errors in the Jacobian in [154].

A signi¬cant difference from the nonlinear equations case arises if only functions are available

and gradients and Hessians must be computed with differences. A simple one-dimensional

analysis will suf¬ce to explain this. Assume that we can only compute f approximately. If we

ˆ

compute f = f + f rather than f , then a forward difference gradient with difference increment

h

ˆ ˆ

f (x + h) ’ f (x)

Dh f (x) =

h

√

differs from f by O(h + f /h) and this error is minimized if h = O( f ). In that case the error

√

in the gradient is g = O(h) = O( f ). If a forward difference Hessian is computed using Dh

as an approximation to the gradient, then the error in the Hessian will be

√ 1/4

∆ = O( g ) = O( f)

(2.13)

and the accuracy in ∇2 f will be much less than that of a Jacobian in the nonlinear equations

case.

If f is signi¬cantly larger than machine roundoff, (2.13) indicates that using numerical

Hessians based on a second numerical differentiation of the objective function will not be very

accurate. Even in the best possible case, where f is the same size as machine roundoff, ¬nite

difference Hessians will not be very accurate and will be very expensive to compute if the Hessian

is dense. If, as on most computers today, machine roundoff is (roughly) 10’16 , (2.13) indicates

that a forward difference Hessian will be accurate to roughly four decimal digits.

One can obtain better results with centered differences, but at a cost of twice the number of

function evaluations. A centered difference approximation to ∇f is

ˆ ˆ

f (x + h) ’ f (x ’ h)

Dh f (x) =

2h

1/3

and the error is O(h2 + f /h), which is minimized if h = O( f) leading to an error in the

2/3

= O( f ).

gradient of Therefore, a central difference Hessian will have an error of

g

4/9

∆ = O(( g )2/3 ) = O( f ),

which is substantially better. We will ¬nd that accurate gradients are much more important than

accurate Hessians and one option is to compute gradients with central differences and Hessians

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

18 ITERATIVE METHODS FOR OPTIMIZATION

2/3

with forward differences. If one does that the centered difference gradient error is O( f) and

therefore the forward difference Hessian error will be

√ 1/3

∆=O g = O( f ).

More elaborate schemes [22] compute a difference gradient and then reuse the function evalua-

tions in the Hessian computation.

In many optimization problems, however, accurate gradients are available. When that is the

case, numerical differentiation to compute Hessians is, like numerical computation of Jacobians

for nonlinear equations [154], a reasonable idea for many problems and the less expensive

forward differences work well.

Clever implementations of difference computation can exploit sparsity in the Hessian [67],

[59] to evaluate a forward difference approximation with far fewer than N evaluations of ∇f .

In the sparse case it is also possible [22], [23] to reuse the points from a centered difference

approximation to the gradient to create a second-order accurate Hessian.

Unless g (xn ) ’ 0 as the iteration progresses, one cannot expect convergence. For this

reason estimates like (2.14) are sometimes called local improvement [88] results. Theorem 2.3.4

is a typical example.

¯

Theorem 2.3.4. Let the standard assumptions hold. Then there are K > 0, δ > 0, and

δ1 > 0 such that if xc ∈ B(δ) and ∆(xc ) < δ1 then

x+ = xc ’ (∇2 f (xc ) + ∆(xc ))’1 (∇f (xc ) + g (xc ))

is de¬ned (i.e., ∇2 f (xc ) + ∆(xc ) is nonsingular) and satis¬es

¯ 2

e+ ¤ K( ec + ∆(xc ) ec + g (xc ) ).

(2.14)

Proof. Let δ be small enough so that the conclusions of Lemma 2.3.1 and Theorem 2.3.2

hold. Let

xN = xc ’ (∇2 f (xc ))’1 ∇f (xc )

+

and note that

x+ = xN + ((∇2 f (xc ))’1 ’ (∇2 f (xc ) + ∆(xc ))’1 )∇f (xc ) ’ (∇2 f (xc ) + ∆(xc ))’1 g (xc ).

+

Lemma 2.3.1 and Theorem 2.3.2 imply

2

+ 2 (∇2 f (xc ))’1 ’ (∇2 f (xc ) + ∆(xc ))’1 ∇2 f (x— ) ec

e+ ¤ K ec

(2.15)

+ (∇2 f (xc ) + ∆(xc ))’1 g (xc ) .

If

∆(xc ) ¤ (∇2 f (x— ))’1 ’1

/4,

then Lemma 2.3.1 implies that

∆(xc ) ¤ (∇2 f (xc ))’1 ’1

/2

and the Banach Lemma [12], [154] states that ∇2 f (xc ) + ∆(xc ) is nonsingular and

(∇2 f (xc ) + ∆(xc ))’1 ¤ 2 (∇2 f (xc ))’1 ¤ 4 (∇2 f (x— ))’1 .

Hence,

(∇2 f (xc ))’1 ’ (∇2 f (xc ) + ∆(xc ))’1 ¤ 8 (∇2 f (x— ))’1 2

∆(xc ) .

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

LOCAL CONVERGENCE 19

We use these estimates and (2.15) to obtain

2

+ 16 (∇2 f (x— ))’1 2

∇2 f (x— ) ∆(xc ) ec + 4 (∇2 f (x— ))’1

e+ ¤ K ec g (xc ) .

Setting

¯

K = K + 16 (∇2 f (x— ))’1 2

∇2 f (x— ) + 4 (∇2 f (x— ))’1

completes the proof.

As is the case with equations, (2.14) implies that one cannot hope to ¬nd a minimizer with

more accuracy that one can evaluate ∇f . In most cases the iteration will stagnate once e is

(roughly) the same size as g . The speed of convergence will be governed by the accuracy in the

Hessian.

The result for the chord method illustrates this latter point. In the chord method we form

and compute the Cholesky factorization of ∇2 f (x0 ) and use that factorization to compute all

subsequent Newton steps. Hence,

x+ = xc ’ (∇2 f (x0 ))’1 ∇f (xc )

and

∆(xc ) ¤ γ x0 ’ xc ¤ γ( e0 + ec ).

(2.16)

Algorithmically the chord iteration differs from the Newton iteration only in that the computation

and factorization of the Hessian is moved outside of the main loop.

Algorithm 2.3.2. chord(x, f, „ )

1. r0 = ∇f (x)

2. Compute ∇2 f (x)

3. F actor ∇2 f (x) = LLT

4. Do while ∇f (x) > „r r0 + „a

(a) Solve LLT s = ’∇f (x)

(b) x = x + s

(c) Compute ∇f (x).

= 0 and ∆ = O( e0 ).

The convergence theory follows from Theorem 2.3.4 with g

Theorem 2.3.5. Let the standard assumptions hold. Then there are KC > 0 and δ > 0

such that if x0 ∈ B(δ) the chord iterates converge q-linearly to x— and

en+1 ¤ KC e0 en .

(2.17)

Proof. Let δ be small enough so that the conclusions of Theorem 2.3.4 hold. Assume that

en , e0 ∈ B(δ). Combining (2.16) and (2.14) implies

¯ ¯

en+1 ¤ K( en (1 + γ) + γ e0 ) en ¤ K(1 + 2γ)δ en .

Hence, if δ is small enough so that

¯

K(1 + 2γ)δ = · < 1,

then the chord iterates converge q-linearly to x— . Q-linear convergence implies that en < e0

¯

and hence (2.17) holds with KC = K(1 + 2γ).

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

20 ITERATIVE METHODS FOR OPTIMIZATION

The Shamanskii method [233], [154], [211] is a generalization of the chord method that

updates Hessians after every m + 1 nonlinear iterations. Newton™s method corresponds to

m = 1 and the chord method to m = ∞. The convergence result is a direct consequence of

Theorems 2.3.3 and 2.3.5.

Theorem 2.3.6. Let the standard assumptions hold and let m ≥ 1 be given. Then there are

KS > 0 and δ > 0 such that if x0 ∈ B(δ) the Shamanskii iterates converge q-superlinearly to

x— with q-order m and

en+1 ¤ KS en m+1 .

(2.18)

As one more application of Theorem 2.3.4, we analyze the effects of a difference approxima-

tion of the Hessian. We follow the notation of [154] where possible. For example, to construct

a Hessian matrix, whose columns are ∇2 f (x)ej , where ej is the unit vector with jth compo-

nent 1 and other components 0, we could approximate the matrix“vector products ∇2 f (x)ej by

forward differences and then symmetrize the resulting matrix. We de¬ne

∇2 f (x) = (Ah + AT )/2,

(2.19) h h

2 2

where Ah is the matrix whose jth column is Dh f (x : ej ). Dh f (x : w), the difference approxi-

mation of the action of the Hessian ∇2 f (x) on a vector w, is de¬ned to be the quotient

±

0, w = 0,

2

Dh f (x : w) =

(2.20)

∇f (x + hw/ w ) ’ ∇f (x)

, w = 0.

h/ w

Note that we may also write

2

Dh f (x : w) = Dh (∇f )(x : w),

where the notation Dh , taken from [154], denotes numerical directional derivative. If x is very

large, then the error in computing the sum x + hw/ w will have to be taken into consideration

in the choice of h.

We warn the reader, as we did in [154], that D2 f (x : w) is not a linear map and that

D2 f (x : w) is not, in general, the same as ∇2 f (x)w.

h

If we compute ∇f (x) + g (x) and the gradient errors satisfy an estimate of the form

g (x) ¤¯

for all x, then the computed difference Hessian is ∇h (∇f + g) and satis¬es

∇2 f (x) ’ ∇h (∇f + g )(x) = O(h + ¯/h).

(2.21)

√

So, as in [154], the choice h ≈ ¯ is optimal in the sense that it minimizes the quantity in the

O-term in (2.21).

The local convergence theorem in this case is [88], [154], [278], as follows.

Theorem 2.3.7. Let the standard assumptions hold. Then there are δ, ¯, and KD > 0 such

that if xc ∈ B(δ), g (x) ¤ ¯c for all x ∈ B(δ), and

h≥M g (xc )

then

’1

x+ = xc ’ (∇hc (∇f (xc ) + g (xc ))) (∇f (xc ) + g (xc ))

satis¬es

e+ ¤ KD (¯c + (¯c + h) ec ).

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

LOCAL CONVERGENCE 21

2.3.2 Termination of the Iteration

It is not safe to terminate the iteration when f (xc ) ’ f (x+ ) is small, and no conclusions can

safely be drawn by examination of the differences of the objective function values at successive

iterations. While some of the algorithms for dif¬cult problems in Part II of this book do indeed

terminate when successive function values are close, this is an act of desperation. For example,

if

n

j ’1 ,

f (xn ) = ’

j=1

then f (xn ) ’ ’∞ but f (xn+1 ) ’ f (xn ) = ’1/(n + 1) ’ 0. The reader has been warned.

If the standard assumptions hold, then one may terminate the iteration when the norm of ∇f

is suf¬ciently small relative to ∇f (x0 ) (see [154]). We will summarize the key points here and

refer the reader to [154] for the details. The idea is that if ∇2 f (x— ) is well conditioned, then a

small gradient norm implies a small error norm. Hence, for any gradient-based iterative method,

termination on small gradients is reasonable. In the special case of Newton™s method, the norm

of the step is a very good indicator of the error and if one is willing to incur the added cost of an

extra iteration, a very sharp bound on the error can be obtained, as we will see below.

Lemma 2.3.8. Assume that the standard assumptions hold. Let δ > 0 be small enough so

that the conclusions of Lemma 2.3.1 hold for x ∈ B(δ). Then for all x ∈ B(δ)

4κ(∇2 f (x— )) e

e ∇f (x)

¤ ¤ .

(2.22)

4 e0 κ(∇2 f (x— )) ∇f (x0 ) e0

The meaning of (2.22) is that, up to a constant multiplier, the norm of the relative gradient

is the same as the norm of the relative error. This partially motivates the termination condition

(2.12).

In the special case of Newton™s method, one can use the steplength as an accurate estimate

of the error because Theorem 2.3.2 implies that

ec = s + O( ec 2 ).

(2.23)

Hence, near the solution s and ec are essentially the same size. The cost of using (2.23) is that

all the information needed to compute x+ = xc + s has been computed. If we terminate the

iteration when s is smaller than our desired tolerance and then take x+ as the ¬nal result, we

have attained more accuracy than we asked for. One possibility is to terminate the iteration when

√ √

s = O( „s ) for some „s > 0. This, together with (2.23), will imply that ec = O( „s )

and hence, using Theorem 2.3.2, that

e+ = O( ec 2 ) = O(„s ).

(2.24)

For a superlinearly convergent method, termination on small steps is equally valid but one

cannot use (2.24). For a superlinearly convergent method we have

ec = s + o( ec ) and e+ = o( ec ).

(2.25)

Hence, we can conclude that e+ < „s if s < „s . This is a weaker, but still very useful,

result.

For a q-linearly convergent method, such as the chord method, making termination decisions

based on the norms of the steps is much riskier. The relative error in estimating ec by s is

| ec ’ s | ec + s e+

¤ = .

ec ec ec

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

22 ITERATIVE METHODS FOR OPTIMIZATION

Hence, estimation of errors by steps is worthwhile only if convergence is fast. One can go further

[156] if one has an estimate ρ of the q-factor that satis¬es

e+ ¤ ρ ec .

In that case,

(1 ’ ρ) ec ¤ ec ’ e+ ¤ ec ’ e+ = s .

Hence

ρ

e+ ¤ ρ ec ¤ s.

(2.26)

1’ρ

So, if the q-factor can be estimated from above by ρ and

s < (1 ’ ρ)„s /ρ,

then e+ < „s . This approach is used in ODE and DAE codes [32], [234], [228], [213],

but requires good estimates of the q-factor and we do not advocate it for q-linearly convergent

methods for optimization. The danger is that if the convergence is slow, the approximate q-factor

can be a gross underestimate and cause premature termination of the iteration.

It is not uncommon for evaluations of f and ∇f to be very expensive and optimizations are,

therefore, usually allocated a ¬xed maximum number of iterations. Some algorithms, such as

the DIRECT, [150], algorithm we discuss in §8.4.2, assign a limit to the number of function

evaluations and terminate the iteration in only this way.

2.4 Nonlinear Least Squares

Nonlinear least squares problems have objective functions of the form

M

1 1

2

R(x)T R(x).

f (x) = ri (x) =

(2.27) 2

2 2

i=1

The vector R = (r1 , . . . , rM ) is called the residual. These problems arise in data ¬tting, for

example. In that case M is the number of observations and N is the number of parameters;

for these problems M > N and we say the problem is overdetermined. If M = N we have a

nonlinear equation and the theory and methods from [154] are applicable. If M < N the problem

is underdetermined. Overdetermined least squares problems arise most often in data ¬tting

applications like the parameter identi¬cation example in §1.6.2. Underdetermined problems are

less common, but are, for example, important in the solution of high-index differential algebraic

equations [48], [50].

The local convergence theory for underdetermined problems has the additional complexity

that the limit of the Gauss“Newton iteration is not uniquely determined by the distance of the

initial iterate to the set of points where R(x— ) = 0. In §2.4.3 we describe the dif¬culties and

state a simple convergence result.

If x— is a local minimizer of f and f (x— ) = 0, the problem min f is called a zero residual

problem (a remarkable and suspicious event in the data ¬tting scenario). If f (x— ) is small, the

expected result in data ¬tting if the model (i.e., R) is good, the problem is called a small residual

problem. Otherwise one has a large residual problem.

Nonlinear least squares problems are an intermediate stage between nonlinear equations and

optimization problems and the methods for their solution re¬‚ect this. We de¬ne the M — N

Jacobian R of R by

(R (x))ij = ‚ri /‚xj , 1 ¤ i ¤ M, 1 ¤ j ¤ N.

(2.28)

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.